data science as a commodity: use madlib, r, & other oss tools for data science (from pivotal...
DESCRIPTION
Slides from the Pivotal Open Source Hub Meetup "Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!" As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact? Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.TRANSCRIPT
BUILT FOR THE SPEED OF BUSINESS
Additional Line 18 Point Verdana
2 © Copyright 2014 Pivotal. All rights reserved. 2 © Copyright 2014 Pivotal. All rights reserved.
Data Science as a Commodity: How to use MADlib, R, and other Publicly Available and Open Source Tools for Data Science Pivotal OSS Meetups Sarah Aerni Pivotal Senior Data Scientist @itweetsarah [email protected]
January 28, 2014
3 © Copyright 2014 Pivotal. All rights reserved.
What we will cover in today’s Meetup � What is data science, big data,
buzzword, buzzword? � What are some examples of data
science in action? � What do I do at Pivotal? � Who are our data scientists? � Why is open source software
important for data science? � What tools does our team use? For
NLP? For optimization? For regression?
� What do I do with loads of data? � How can I create good models? � What types of open source tools can
I use to build models? � How can I build a quick app? � What can I do to get started
analyzing text data? � Which tools exist to create
visualizations of my data that I can understand?
4 © Copyright 2014 Pivotal. All rights reserved.
What we will not cover #notdatascience
5 © Copyright 2014 Pivotal. All rights reserved.
Instead: Practical Data Science Tools #useful
http://blog.gopivotal.com/p-o-v/the-eightfold-path-of-data-science – Kaushik Das
6 © Copyright 2014 Pivotal. All rights reserved.
Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-to- revolutionize-healthcare-get-data-scientists-and-app-developers-together/
7 © Copyright 2014 Pivotal. All rights reserved.
Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-to- revolutionize-healthcare-get-data-scientists-and-app-developers-together/ “The use of statistical and machine
learning techniques on big multi-structured data — in a distributed computing environment — to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.” – Annika Jimenez http://blog.gopivotal.com/news-2/annika-jimenez-on-disruptive-data-science-at-the-strata-conference
8 © Copyright 2014 Pivotal. All rights reserved.
DATA IS THE NEW CENTER OF GRAVITY
Data > Application!“BIG DATA IS THE NEW NORMAL”
“‘BIG DATA’ BECOMES ‘DATA’ ONCE AGAIN”
9 © Copyright 2014 Pivotal. All rights reserved.
What Can “Small Data” Scientists Bring on Their “Big Data” Journey?
http://factspy.net/the-difference-between-geeks-vs-nerds/
10 © Copyright 2014 Pivotal. All rights reserved.
What Can “Small Data” Scientists Bring on Their “Big Data” Journey?
Flat files
Distributed computing
HDFS
In-memory model building
Cloud computing
MapReduce
Command-line tools
Databases
Command-line tools
Small Data Big Data Many tools and approaches are being adapted to big data technologies
11 © Copyright 2014 Pivotal. All rights reserved.
Basic DS Tools: From Command-line to GUI
Ian Huston, Alex Kagoshima, Ronert Obst
� Quick-and-dirty tricks using command-line tools
– Fast feedback - interactive – Fast to process – Easy to write, hard to read – Background processing (screen)
12 © Copyright 2014 Pivotal. All rights reserved.
Basic DS Tools: From Command-line to GUI
Ian Huston, Alex Kagoshima, Ronert Obst
� Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster
� Quick-and-dirty tricks using command-line tools
– Fast feedback - interactive – Fast to process – Easy to write, hard to read – Background processing (screen)
13 © Copyright 2014 Pivotal. All rights reserved.
Basic DS Tools: From Command-line to GUI
� Python and R – Rstudio – iPython (iPythonNotebook)
Ian Huston, Alex Kagoshima, Ronert Obst
� Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster
� Quick-and-dirty tricks using command-line tools
– Fast feedback - interactive – Fast to process – Easy to write, hard to read – Background processing (screen)
14 © Copyright 2014 Pivotal. All rights reserved.
Favorite python and R packages and resources Python
– NumPy – SciPy – scikit-learn – machine
learning package – statsmodels – pandas – pyMC – IPython
(IPythonNotebook) – matplotlib
Ian Huston, Alex Kagoshima, Ronert Obst
15 © Copyright 2014 Pivotal. All rights reserved.
Favorite python and R packages, resources, and more
� R – ggplot – reshape – plyr – Shiny – Good support for time
series analyses – Rstudio ( weave ) – foreach, parallel – taskviews – parboost
Ian Huston, Alex Kagoshima, Ronert Obst
16 © Copyright 2014 Pivotal. All rights reserved.
A New Platform for a New Era What do I do at Pivotal?
...ETC
Cloud Fabric “The new OS”
Data Fabric “The new Database”
App Fabric “The new Middleware”
“The new Hardware”
DATA-DRIVEN APPLICATION DEVELOPMENT
17 © Copyright 2014 Pivotal. All rights reserved.
Pivotal Big Data Technology: HAWQ Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database
18 © Copyright 2014 Pivotal. All rights reserved.
Performance Through Parallelism � Automatic parallelization
– Load and query like any database – Automatically distributed tables
across nodes
� Analytics-oriented query optimization
� Scalable MPP architecture – All nodes can scan and process in
parallel – Linear scalability by adding nodes
Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database
19 © Copyright 2014 Pivotal. All rights reserved.
Data Science Tools for Big Data C O M M E R C I A L OP E N SO U R C E (O R FR E E)
PL/R, PL/Python PL/Java
20 © Copyright 2014 Pivotal. All rights reserved.
Making sense of your “big data” � Large volumes of data may be difficult to understand
– ~100 tables – Tens of thousands of columns
21 © Copyright 2014 Pivotal. All rights reserved.
Making sense of your “big data” � Large volumes of data may be difficult to understand
– ~100 tables – Tens of thousands of columns
� How do you build models that use all the data? Score all the data?
22 © Copyright 2014 Pivotal. All rights reserved.
Making sense of your “big data” � Large volumes of data may be difficult to understand
– ~100 tables – Tens of thousands of columns
� How do you build models that use all the data? Score all the data?
� Where do you focus your effort? – Getting a rapid grasp of relevant fields is important – Scanning lots of data is slow, creating models with huge numbers of features is
possible, but generally better to understand your data – Columns with little or no variation or only null values
23 © Copyright 2014 Pivotal. All rights reserved.
Making sense of your “big data” � Large volumes of data may be difficult to understand
– ~100 tables – Tens of thousands of columns
� How do you build models that use all the data? Score all the data?
� Where do you focus your effort? – Getting a rapid grasp of relevant fields is important – Scanning lots of data is slow, creating models with huge numbers of features is
possible, but generally better to understand your data – Columns with little or no variation or only null values
� These functions exist in MADlib
24 © Copyright 2014 Pivotal. All rights reserved.
MADlib In-Database Functions Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers
Matrix Factorization • Single Value Decomposition
(SVD) • Low-Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber
white, clustered, marginal effects)
Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis,
Market Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators • CountMin (Cormode-
Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent
Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions
25 © Copyright 2014 Pivotal. All rights reserved.
MADlib in Action: Regression on Billions of Rows
Drilling into the San Andreas Fault at Parkfield California. Credit: Stephen H. Hickman, USGS Rashmi Raghu
� Input Data – 10s of millions of rows from data collected at multiple drill
testing sites – Sensor data for drills during operation, including rate of
penetration, depth of penetration, weight on drill bit and more
� Data Massaging and Review – Rapid summarization of many columns of data - to identify
outliers, missing data and remove them from analysis – Used window functions to construct a moving average
(smoothing) of all the features and dependent variable
� Model – Linear regression on the complete dataset – K-means clustering to determine similarities of sites
26 © Copyright 2014 Pivotal. All rights reserved.
Linear Regression: Streaming Algorithm
� Finding linear dependencies between variables
� How to compute with a single scan?
27 © Copyright 2014 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
XT
y
XT y = xiT yi
i∑
28 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master
XT y
Segment 1 Segment 2
X1T y1 X2
T y2+ =
29 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master Segment 1 Segment 2
XT yX1T y1 X2
T y2+ =
30 © Copyright 2013 Pivotal. All rights reserved.
Performing a linear regression on 10 million rows in seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
31 © Copyright 2014 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring � MADlib allows users to easily and
create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’);!
MADlib model function Table containing
training data
Table in which to save results
Column containing dependent variable Features included in the
model
32 © Copyright 2014 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring � MADlib allows users to easily and
create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’,!
‘bedroom’);!
MADlib model function Table containing
training data
Table in which to save results
Column containing dependent variable Features included in the
model Create multiple output models (one for each value of bedroom)
33 © Copyright 2014 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring � MADlib allows users to easily and
create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’);!
SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef!)as predict !
FROM houses, houses_linregr m;!
MADlib model scoring function
Table with data to be scored Table containing model
34 © Copyright 2014 Pivotal. All rights reserved.
PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge
Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics
� Simple solution: Translate R code into SQL
Woo Jung
d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ tax!
! ! !+ bath!! ! !+ size!! ! !, data=d)!
Pivotal R SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!'price’,!
'ARRAY[1, tax, bath, size]’);!
SQL Code
http://gopivotal.github.io/PivotalR/
35 © Copyright 2014 Pivotal. All rights reserved.
PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge
Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics
� Simple solution: Translate R code into SQL
Woo Jung
d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ as.factor(state)!
! ! ! !+ tax!! ! ! !+ bath!! ! ! !+ size!! ! ! !, data=d)!
Pivotal R
# Build a regression model with a different!# intercept term for each state!# (state=1 as baseline).!# Note that PivotalR supports automated!# indicator coding a la as.factor()!!
http://gopivotal.github.io/PivotalR/
36 © Copyright 2014 Pivotal. All rights reserved.
PivotalR Design Overview
SQL to execute
Computation results
RPostgreSQL
Data lives here
R à SQL
PivotalR
No data here
Database w/ MADlib
• Call MADlib’s in-DB machine learning functions directly from R
• Syntax is analogous to native R function
• Data doesn’t need to leave the database • All heavy lifting, including model estimation
& computation, are done in the database
Woo Jung
http://gopivotal.github.io/PivotalR/
37 © Copyright 2014 Pivotal. All rights reserved.
PivotalR: Current Features
MADlib Functionality • Linear Regression • Logistic Regression • Elastic Net • ARIMA • Marginal Effects • Cross Validation • Bagging • summary on model objects
• Automated Indicator Variable Coding as.factor
• predict
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/% ^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview • sort
• c mean sum sd var min max length colMeans colSums
• db.connect db.disconnect db.list db.objects
db.existsObject delete • dim names
• content
And more ... (SQL wrapper)
http://gopivotal.github.io/PivotalR/
38 © Copyright 2014 Pivotal. All rights reserved.
Woo Jung
http://gopivotal.github.io/PivotalR/
39 © Copyright 2014 Pivotal. All rights reserved.
Woo Jung
http://www.rstudio.com/shiny/ http://gopivotal.github.io/PivotalR/
40 © Copyright 2014 Pivotal. All rights reserved.
Shiny Showcase: Example Web Apps in R � Users can choose
input parameters with sliders, drop-downs, and text fields.
� HTML/JavaScript knowledge not required.
http://www.rstudio.com/shiny/
41 © Copyright 2014 Pivotal. All rights reserved.
� Users can choose input parameters with sliders, drop-downs, and text fields.
� HTML/JavaScript knowledge not required.
Shiny Showcase: Example Web Apps in R
http://www.rstudio.com/shiny/
42 © Copyright 2014 Pivotal. All rights reserved.
http://d3js.org/
43 © Copyright 2014 Pivotal. All rights reserved.
http://d3js.org/
D3 Data-Driven Documents
44 © Copyright 2014 Pivotal. All rights reserved.
http://d3js.org/
D3 Data-Driven Documents
45 © Copyright 2014 Pivotal. All rights reserved.
PyMADlib
� Python wrapper for MADlib
http://nbviewer.ipython.org/gist/vatsan/5275846
46 © Copyright 2014 Pivotal. All rights reserved.
PyMADlib
� Python wrapper for MADlib
http://nbviewer.ipython.org/gist/vatsan/5275846
47 © Copyright 2014 Pivotal. All rights reserved.
Procedural Languages in Big Data Science � HAWQ & PL/X can take advantage of “data
parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks
� Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks
� Examples of ‘data parallel’ problems: – Counting words in documents – Genome-Wide Association Study – Studying network anomalies
Network Interconnect
Master Severs
Segment Severs
Doc1 Doc2 DocM
Stem1 Stem2 StemM
SQL & R
Count1 Count2 CountM http://gopivotal.github.io/gp-r/
48 © Copyright 2014 Pivotal. All rights reserved.
Structure of input table for PL/R function
� Topology: Hubs connected to multiple terminal points
� Using historical readings, solve a linear program to establish baseline behavior, for example number of shipments
� Detecting anomalies within sub-networks on future observations
Terminal readings
Columns Network ID Topology Network Readings
Description ID of the network. 300K in total.
Array of integers defining the topology tree.
Array of readings from network terminal points over (say) a week.
Vivek Ramamurthy
0
C
D B
A
49 © Copyright 2014 Pivotal. All rights reserved.
Performance Analysis
0 100 200 300 400 500 600 700
0 50 100 150 200 250 300
Tim
e (s
econ
ds)
Number of networks (in thousands)
Execution time v/s number of networks
Vivek Ramamurthy
Number of networks
Time/network (ms)
Total time (seconds)
500 6.604 3.30
1000 3.637 3.64
5000 2.822 14.11
10,000 2.356 23.56
50,000 2.160 108.02
100,000 2.142 214.20
150,000 2.162 324.29
200,000 2.142 428.48
250,000 2.138 534.69
300,000 2.132 639.85
50 © Copyright 2014 Pivotal. All rights reserved.
Performance Analysis R package used optim quadprog Rsymphony Rglpk
Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s
300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min
Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms
Vivek Ramamurthy
51 © Copyright 2014 Pivotal. All rights reserved.
Performance Analysis R package used optim quadprog Rsymphony Rglpk
Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s
300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min
Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms
COIN-OR : Computational Infrastructure for Operations Research
– Libraries for linear and non-linear programming, integer programming
– SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs
GLPK : GNU Linear Programming Kit Used for large-scale LPs, MIPs and related problems
Vivek Ramamurthy
http://www.coin-or.org/
52 © Copyright 2014 Pivotal. All rights reserved.
Performance Analysis R package used optim quadprog Rsymphony Rglpk
Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s
300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min
Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms
COIN-OR : Computational Infrastructure for Operations Research
– Libraries for linear and non-linear programming, integer programming
– SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs
GLPK : GNU Linear Programming Kit – Used for large-scale LPs, MIPs and related problems http://www.gnu.org/software/glpk/
Vivek Ramamurthy
http://www.coin-or.org/
53 © Copyright 2014 Pivotal. All rights reserved.
Natural language processing
Data sources Text sources Documents, books, emails Speech Phone logs, conversations
NLP processing pipeline
Sentence detection Tokenization Morphological
stemming
Stop word removal
Word-sense disambiguation
Part-of-Speech tagging
Syntactic parsing
Semantic role labeling
Entity recognition
Reference resolution
Event processing …
Common tasks/tools in NLP
Applications Word clouds Topic modeling Sentiment analysis Machine translation Document classification Document summarization Language generation Search Question answering Information Extraction
Niels Kasch
54 © Copyright 2014 Pivotal. All rights reserved.
Open source tools for common NLP tasks
W O R D C L O U D S
I N F O R M A T I O N E X T R A C T I O N
T O P I C M O D E L I N G / T E X T C L A S S I F I C A T I O N
OPEN SOURCE SOFTWARE RELEVANT NLP TOOLS
Niels Kasch
55 © Copyright 2014 Pivotal. All rights reserved.
Open source tools for common NLP tasks
W O R D C L O U D S Stemming/
lemmatization
Stop word removal
Tokenization • GPText • Apache UIMA • OpenNLP (Java)
• NLTK (Python) • WordNet • Pytagcloud
I N F O R M A T I O N E X T R A C T I O N
T O P I C M O D E L I N G / T E X T C L A S S I F I C A T I O N
OPEN SOURCE SOFTWARE RELEVANT NLP TOOLS
Niels Kasch
56 © Copyright 2014 Pivotal. All rights reserved.
Open source tools for common NLP tasks
W O R D C L O U D S Stemming/
lemmatization
Stop word removal
Tokenization • GPText • Apache UIMA • OpenNLP (Java)
• NLTK (Python) • WordNet • Pytagcloud
I N F O R M A T I O N E X T R A C T I O N
T O P I C M O D E L I N G / T E X T C L A S S I F I C A T I O N
Language detection
Stemming/lemmatization
Stop word removal
Tokenization • Madlib (PLDA) • gensim (LSA & LDA package for python) • https://code.google.com/p/language-detection/
OPEN SOURCE SOFTWARE RELEVANT NLP TOOLS
Niels Kasch
57 © Copyright 2014 Pivotal. All rights reserved.
Open source tools for common NLP tasks
W O R D C L O U D S Stemming/
lemmatization
Stop word removal
Tokenization • GPText • Apache UIMA • OpenNLP (Java)
• NLTK (Python) • WordNet • Pytagcloud
I N F O R M A T I O N E X T R A C T I O N
Syntactic parsing
Sentence detection
Language detection
Tokenization
Entity extraction
Relationship extraction
T O P I C M O D E L I N G / T E X T C L A S S I F I C A T I O N
Language detection
Stemming/lemmatization
Stop word removal
Tokenization • Madlib (PLDA) • gensim (LSA & LDA package for python) • https://code.google.com/p/language-detection/
• GPText and Madlib • OpenNLP • NLTK
• Stanford CoreNLP (incl. POS tagger, NER, parser, etc.)
OPEN SOURCE SOFTWARE RELEVANT NLP TOOLS
Niels Kasch
58 © Copyright 2014 Pivotal. All rights reserved.
Topic Analysis – MADlib pLDA
Prepare dataset for
Topic Modeling
Social Media
Tokenizer
Align Data
Stemming, frequency
filtering
Natural Language Processing - GPText
Filter relevant content
Srivatsan Ramanujam
59 © Copyright 2014 Pivotal. All rights reserved.
Topic Analysis – MADlib pLDA
Prepare dataset for
Topic Modeling
MADlib Topic Model
Social Media
Tokenizer
Align Data
Stemming, frequency
filtering
Natural Language Processing - GPText
Filter relevant content
Topic composition
Topic Clouds
Topic Graph
Srivatsan Ramanujam
60 © Copyright 2014 Pivotal. All rights reserved.
Is there more? What’s next?
blog.gopivotal.com/tag/data-science-tech
blog.gopivotal.com/tag/data-science
BUILT FOR THE SPEED OF BUSINESS