turbo...
TRANSCRIPT
© 2012 IBM Corporation1
Revolution Confidential
Revolution R Enterprise for IBM Netezza
© 2012 IBM Corporation2
Revolution ConfidentialIBM Netezza with Revolution Analytics
High-performance, in-database analytics platform for Big Data– Massively parallel processing delivers 10-100x performance– Run analytics in-database and eliminate data movement– Scalable architecture fosters experimentation
Innovation with Advanced Analytics– Analytic modeling with most current statistical methods and 2,500+
open source packages Enterprise ready advanced analytics software, services &
support – Security, IDE, training, professional services– Web Services stack enables integration with front-end
presentation layer
© 2012 IBM CorporationMarch 1, 2012
Revolution Analytics
© 2012 IBM Corporation4
Revolution ConfidentialWhat is R?
Data analysis software A programming language
– Development platform designed by and for statisticians– Object-oriented: vector, matrix, model, …– Built-in libraries of algorithms
An environment– Huge library of algorithms for data access, data manipulation, analysis
and graphics An open-source software project
– Free, open, and active A community
– Thousands of contributors, 2 million users– Resources and help in every domain
Download the White Paper
R is Hotbit.ly/r-is-hot
Revolution Confidential
The professor who invented analytic software for the experts now wants to take it to the masses
Most advanced statistical analysis software available
Half the cost of commercial alternatives
2M+ Users
2,500+ Applications
Statistics
Predictive Analytics
Data Mining
Visualization
Finance
Life Sciences
Manufacturing
Retail
Telecom
Social Media
Government
5
Power
Productivity
Enterprise Readiness
Revolution Confidential
R evolution R E nterpris e has the Open-S ource R E ngine at the core
2,500 community packages and growing exponentially
6
R Engine Language Libraries
Open Source R Packages
Technical Support
Web ServicesAPI
Big DataAnalysis
RevolutionProductivity
Environment
BuildAssurance
ParallelTools
Multi-ThreadedMath Libraries
TechnologyPartners
© 2012 IBM CorporationMarch 1, 2012
Working with Revolution R Enterprise for IBM Netezza
© 2012 IBM Corporation8
Revolution ConfidentialRevolution R Enterprise for IBM Netezzainside the IBM Netezza Architecture
IBM Netezza Analytics
© 2012 IBM Corporation9
Revolution ConfidentialIn-Database Paradigms for using R
In-database Scoring– Family of apply functions which score
analytic models by using data parallelism
– Underlying truism is that there is a fact that can be applied across all data
Big Data Analytics – Family of parallelized, in-database
analytics that have R wrappers and work on entire data set
– Underlying truism exists across all data
Grouped by Row (tapply)– Data and Task Parallelism
• Data flow technique to apply analytics to naturally occurring groups of data using non-parallelized analytics
– Underlying relationship in data is by a group
Examples
– Customer lifetime value– Credit score– Affinity– Good stock/bad stock
Big data analytics– Clustering of all data to determine
groupings– Models that are apply across a whole
data set – decision trees– Data transformation – variable
selection, correlationGroup \
– Forecasting – by store, stock symbol, etc.
– Build model for each customer or product or etc.
© 2012 IBM Corporation10
Revolution ConfidentialAccess In-Database Language Support from R
SQL Java
PythonC
Fortran C++
© 2012 IBM Corporation11
Revolution ConfidentialOpen Source R Package Support
Vertical• Econometrics • Experimental Design• Computational Physics• Clinical Trials• Environmetrics• Finance• Genetics• Medical Imaging • Pharmacokinetics• Phylogenetics• Psychometrics• Social Sciences
Horizontal• Bayesian
• Cluster • Distributions• Graphics• Graphical Models• Machine Learning• Multivariate • Natural Language Processing• Optimization• Robust Statistical Metrics• Spatial• Survival Analysis• Time Series
2500+ community packages
© 2012 IBM Corporation12
Revolution ConfidentialUsing Revolution R Enterprise with IBM Netezza
R Packages integrate and push analytics processing
in-database
Revolution R Enterprise - Workstation
HTTP
Revolution R Enterprise - Server
RevoDeployR Server Web Services Interface for R
Business Intelligence, Excel or Third-Party Application
HostIBM Netezza Analytics
S-BladeIBM Netezza Analytics
S-BladeIBM Netezza Analytics
S-BladeIBM Netezza Analytics
S-BladeIBM Netezza Analytics
S-BladeIBM Netezza Analytics
RODBC &
nzODBC
RODBC &
nzODBC
© 2012 IBM Corporation13
Revolution ConfidentialDeploying Revolution R Enterprise to IBM Netezza
•Remote terminal connection to Host•Create your R Script•Compile and Register your R Script as an AE (UDAP)•Execute SQL that will invoke the registered AE•Go back Revolution R Client to retrieve results and continue additional analysis
HostIBM Netezza Analytics
S-BladeIBM Netezza Analytics
S-BladeIBM Netezza Analytics
S-BladeIBM Netezza Analytics
S-BladeIBM Netezza Analytics
S-BladeIBM Netezza Analytics
© 2012 IBM Corporation14
Revolution ConfidentialRevolution R Enterprise Client Configuration
Revolution R Enterprise– Productivity Environment
Netezza ODBC Drivers ‘nz’ R Packages
– nzA, nzR, nzMatrix
R Package Dependencies– RODBC– caTools– Tree– Bitops– E1071– Rgl– Ca– MASS– XML
© 2012 IBM Corporation15
Revolution ConfidentialIBM Netezza In-Database Analytics from Revolution R
nzRPackage
Encapsulate database and expose “R”-like constructs
R data.frame = database tableApply an R function to a row of data or grouped rows of data
nzA Package
Entry point to the nzAnalytics
Explicitly parallelized algorithms that run in
database
nzMatrixPackage
Encapsulation of Matrices and operations in Database
nz.matrix construct in R to access matrices in the
database
R operations on nz.matrix translate to
matrix stored procedure operations
© 2012 IBM Corporation16
Revolution ConfidentialnzR Package
Basic Functions Sample CodeDatabase Connection nzConnect
nzConnectDSN
SQL Execution nzQuery, nzScalarQuery nzDeleteTable
Data Management as.nz.data.frame nz.data.frame
Apply an R function nzApplynzTApply nzGroupedApply
R Package Management nzInstallPackages nzIsPackageInstalled
#load packages
library(nzr)
#connect to a database via ODBCnzConnect("admin", "xyz", "127.0.0.1", "iclasstest")
#load the iris tablenzdf <- nz.data.frame("iris")
#run a nzTApply against the nz dataframefun <- function(x) max(x[,1])nzTApply(nzdf, nzdf[,5], fun)
© 2012 IBM Corporation17
Revolution ConfidentialnzA Package
Data ManipulationMoments nz.moments
Quantiles nz.quantile, nz.quartile
Outlier Detection nz.outliers
Frequency Table nz.bitable
Histogram nz.hist
Pearson's Correlation nz.corr
Spearman's Correlation nz.spearman.corr, nz.spearman.corr.s
Covariance nz.cov, nz.cov.matrix
Mutual Information nz.mutualinfo
Chi-Square Test nzChisq.test, nz.chisq.test
t -Test t.ls.test, t.me.test, t.pmd.test, t.umd.test
Mann-Whitney-Wilcoxon Test nz.mww.test
Wilcoxon Test nz.wilcoxon.test
Canonical Correlation nz.canonical.corr
One-Way ANOVA nzAnova, nz.anova.CRD.test, nz.anova.RBD.test
Principal Component Analysis nzPCA
Tree-Shaped Bayesian Networks nz.TBNet Apply, nz.TBNet Grow, nz.BigBNControl, nz.TBNet1g2p, nz.TBNet1g,nz.TBNet2g
© 2012 IBM Corporation18
Revolution ConfidentialnzA Package
Data Transformations
Model Diagnostics
Discretization nz.efdisc, nz.emdisc, nz.ewdisc
Standardization and Normalization nz.std.norm
Data Imputation nz.impute.data
Misclassification Error nz.cerror
Confusion Matrix nz.acc, nz.CMATRIX STATS
Mean Absolute Error nz.mae
Mean Square Error nz.mse
Relative Absolute Error nz.rae
Percentage Split nz.percentage.split
Cross-Validation nz.cross.validation
© 2012 IBM Corporation19
Revolution ConfidentialnzA Package
Classification
Regression
Clustering
Associative Rule Mining
Naive Bayes nzNaiveBayes, nz.naivebayes,nz.predict.naivebayes
Decision Trees nzDecTree, nz.dectree, nz.grow.dectree,nz.print.dectree,nz.prune.dectree,nz.predict.dectree
Nearest Neighbors nz.knn
Linear Regression nzLm
Regression Trees nzRegTree, nz.regtree, nz.grow.regtree, nz.print.regtree, nz.predict.regtree
K-Means Clustering nzKMeans, nz.kmeans, nz.predict.kmeans
Divisive Clustering nz.divcluster, nz.predict.divcluster
FP-Growth nz.fpgrowth, nz.prepare.fpgrowth
© 2012 IBM Corporation20
Revolution ConfidentialnzMatrix Package
Data ManipulationCoerce or point to a nz.matrix as.nz.matrix, as.nz.matrix.matrix, nz.matrixCombine Matrices nzCBind, nzRBindCreate Matrices From Tables nzCreateMatrixFromTable, nzCreateTableFromMatrixCreate Special Matrices nzIdentityMatrix, nzNormalMatrix, nzOnesMatrix,
nzRandomMatrix, nzVecToDiagDecomposition nzSVD, svd, nzEigenDelete Matrices nzDeleteMatrix, nzDeleteMatrixByNameDimensions dim, NCOL, ncol, NROW, nrowMathematical Functions abs, add, aubtr, ceiling, div, exp, floor, ln, log10, mod,
mult, nzPowerMatrix, pow, rounding, sqrt, truncMatrix Engine Initialization nzMatrixEngineInitializationMatrix Info is.nz.matrix, isSparse, nzExistMatrix, nzExistMatrixByName,
nzGetValidMatrixNameOperators *, +, -, <, ==, >, nzKronecker, nzPMax, nzPMin, nzSetValue,
[, scale, tPrinting Matrices print.nz.matrixSolve nzInv, nzSolve, nzSolveLLSSparse Matrices isSparse, nzSparse2matrixSummaries
nzAll, nzAny, nzMax, nzMin, nzSsq, nzSum, nzTr
© 2012 IBM CorporationMarch 1, 2012
DemonstrationUsing Revolution R with IBM Netezza
Revolution Confidential
Turbo-C harge Your A nalytics with IB M Netezza and R evolution R E nterpris e
P res ented by:
Derek M Norton, S enior S ales E ngineer
Revolution ConfidentialUs e C as e – C redit R is k
We have a dataset comprised of individuals and their credit risk stored on the Netezza Appliance
The goal is to model if someone is “approvable” for a loan. This use case will follow a modeling process
(though condensed) from start to finish. I will discuss each of the parts and at the end
there will be a demo of the code
Revolution ConfidentialModeling E xerc is e
1. Learning more about the data2. Prepare the data for modeling3. Fit models to the data4. Model Performance
Revolution Confidential1. L earning more about the data
Connect to the IBM Netezza appliance Summarize the data Visualize the data
Continuous Variable
x
Freq
uenc
y
0 5 10 15 20 25
050
100
150
200
250
300
High School Diploma Bachelors Degree Masters Degree Professional Degree PhD
Discrete Varible
050
100
150
200
250
300
Revolution Confidential2. P repare the data for modeling
Split the data in to 70/30 Training/Test sets Transform some variables Discretize numeric variables for later use
Revolution Confidential3. F it models to the data
Build two different models to predict if an individual is “approvable” Decision Tree Naïve Bayes
Revolution Confidential4. Model P erformanc e
Examine confusion matrices to determine: Training performance Test performance
Revolution ConfidentialDemo
© 2012 IBM Corporation9
Summary Familiar environment for R Developers
– World-class productivity tools– Enterprise class service, support and integration
Execution of analytics in-database – Analytic computing distributed across Netezza nodes and run
in a massively parallel manner– Each Netezza node gets a data slice and analytics are pushed
down from the Host to the individual nodes Capabilities
– R Code executed on Netezza nodes in row-by-row fashion or on groups of rows
– Enables access to explicitly parallelized algorithms running on entire data set
– Large-scale parallel matrix operations on database tables Performance
– 10-100x Performance improvements
Revolution ConfidentialC ontac t Us
Derek NortonSolutions ExecutiveRevolution [email protected]
www.revolutionanalytics.com +1 (650) 646 9545 Twitter: @RevolutionR
Bill ZanineBusiness Solutions Executive, Analytics Solutions IBM [email protected]