big data analytics on teradata with revolution r enterprise bill jacobs
Post on 08-May-2015
1.454 Views
Preview:
DESCRIPTION
TRANSCRIPT
1877 Big Data Analytics on Teradata: An Introduction toRevolution R Enterprise
Bill Jacobs
Dir., Product Marketing, Revolution Analytics
Demystifying R
What is R
Why is it so
popular?
Is it only open
source?
3
Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013
4
THE PERFECT STORM
+ Computing Power + Data + Pace of Business+ Customer Expectations
Our view: Big Data meets Big Math = New Business Outcomes
+Data Science+Computer Science +Management Science
Better Business Decisions
New Business Outcomes
Confidential to Revolution Analytics
Big Analytics Delivers Value from Big Data
5
Volume Variety Velocity
The three V’s of Big Data Big Analytics:
Maximizing Value,accommodating data Volatility,while assuring Veracity of insights
The three Vs of Big Data:
6
WELCOME & INTRODUCTIONS
R Open Source- Language, Community, Collaboration
- Robert Gentleman & Ross Ihaka, 1993
- Version 1.0 released 2000
- 2.5 Million Global Users
- Over 4,800 add-on “Packages”
- Why R?
R in Universities = New Talent
Emerging Modeling/Visualization
Lower Cost Alternative
Open Source = Flexible & Innovative
Access to Free Packages
Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013
Source: http://r4stats.com/popularity 7
R is Exploding in Popularity & Functionality
Web Site PopularityNumber of links to main web site
4,000
2,000
1,050
900
600
R
SAS
SPSS
S-Plus
Stata
Scholarly ActivityGoogle Scholar hits (’05-’09 CAGR)
R 46%
SAS -11%
SPSS -27%
S-Plus 0%
Stata 10%
1995 2000 2005 2010
0
1,000
2,000
3,000
4,000
Internet DiscussionMean monthly traffic on email discussion list
R
SAS
Stata
SPSS
S-Plus
Package GrowthNumber of R packages listed on CRAN
0
500
1000
1500
2000
2500
R is exploding in popularity & functionality
“A key benefit of R is that it provides near-instant availability of new and
experimental methods created by its user base — without waiting for the
development/release cycle of commercial software. SAS
recognizes the value of R to our customer base…”
Product Marketing Manager SAS Institute, Inc
“I’ve been astonished by the rate at which R has been adopted. Four
years ago, everyone in my economics department [at the
University of Chicago] was using Stata; now, as far as I can tell, R is
the standard tool, and students learn it first.”
Deputy Editor for New Products at Forbes
R Usage GrowthRexer Data Miner Survey, 2007-2013
70% of data miners report using R
24% use R as primary tool
Source: www.rexeranalytics.com
R Is The Most Commonly Used Primarly Analytics Tool
70% of data miners report using R
24% use R as primary tool
Source: www.rexeranalytics.com
Source: www.rexeranalytics.com
10
Example of advanced visualization with R
Facebook Network Graphic
11
R Community, collaboration and breadth: CRAN task views (sub set of 4800+ packages)
Source: http://www.maths.lancs.ac.uk/~rowlings/R/TaskViews/
Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013
12
Key Big Data Challenge: The Analytics Talent Pool
13
2 Million R Users
The Analytics Talent Pool With R
14
Big Data In-memory bound Hybrid memory & disk scalability
Operates on bigger volumes & factors
Speed of Analysis
Single threaded Parallel threading Shrinks analysis time
Enterprise Readiness
Community support
Commercial support Delivers full service production support
Analytic Breadth & Depth
5000+ innovative analytic packages
Leverage open source packages plus Big Data ready packages
Supercharges R
Commercial Viability
Risk of deployment of open source
Commercial license Eliminate risk with open source
R is open source and drives analytic innovation but….has some limitations for Enterprises
250Customers
Our History & Our Future
500Customers
2007 2013 2015 2017
CompanyFounding
Chapter 1Capture Mindshare
Chapter 2Mobilize withMarket Focus
Chapter 3ScalableGrowth
Revolution R EnterpriseV1 through V6.1
Revolution R EnterpriseV6.2 through V9
Revolution R EnterpriseV10 through v11
1000Customers
Relocate HQ to Palo Alto
NA OfficesNYC
Dallas
Company Confidential – Do not distribute 15
Digital Media & Retail
200+ Customer Stories
16
Finance & Insurance Healthcare & Life Sciences
Manufacturing & High TechAcademic & Gov’t
Revolution Confidential
Revolution Analytics - Overview
17
We are the only provider of a commercial analytics platform based on the open source R statistical computing language.
Power
Productivity
Enterprise Readiness
Stable, scalable
multi-platform with
world-wide support
Easier to build and deploy
analytic applications
Professional services
enablement
Distributed, high performance
analytical algorithms
World Wide Support Teams• Standard and Premium Programs• Technical Account Managers• Customer Success Managers
Professional Services• Architecture planning• Systems Integration• Advanced analytic applications• Full life cycle projects
Customers Revolutionize their Business
Power
“…we saw about a 4x performance improvement on 50 million records. It works brilliantly.” - CEO, John Wallace, DataSong
4X performance 50M records scored daily
Scalability
“We’ve been able to scale our solution to a problem that’s so big that most companies could not address it…..” - SVP Analytics, Kevin Lyons, eXelate
TB’s data from 200+ data sources10’s thousands attributes100’s millions of scores daily
2X data 2X attributes no impact on performance
Performance
“We need a high-
performance analytics …
we can now identify
opportunities for our clients
that would otherwise be
lost.”
- Chief Analytics Officer,
Leon Zemel, [x+1]
19
Revolution R Enterprise
What is Revolution R Enterprise?
How does Revolution R Enterprise work with Teradata Database?
Revolution R Enterprise
High Performance, Scalable Analytics Portable Across Enterprise Platforms Easier to Build & Deploy Analytics
is….the only big data big analytics platform based on open source R,the defacto statistical computing language for modern analytics
21
How is RRE Used?
Discovering Patterns with Big Data
Building Models Efficiently
Flexibly Deploying Models to Consumers
Customer segmentation
Market basket analysis
Social networking analysis
Fraud detection Marketing attribution Sentiment analysis …and much more
Credit risk Customer churn Propensity to buy Market risk Operational risk …and much more
Customer lifetime value Pricing optimization Recommendation
engines …and much more
22
Introducing Revolution R Enterprise (RRE)The Big Data Big Analytics Platform
R+
CR
AN
Rev
oR
DistributedR
DevelopR DeployR
ScaleR
ConnectR
Big Data Big Analytics Ready
– Enterprise readiness
– High performance analytics
– Multi-platform architecture
– Data source integration
– Development tools
– Deployment tools
23
The Platform Step by Step:R Capabilities
R+
CR
AN
DistributedR
ScaleR
ConnectR
R+CRAN• Open source R interpreter
• UPDATED R 3.0.2• Freely-available R algorithms• Algorithms callable by RevoR• Embeddable in R scripts• 100% Compatible with existing
R scripts, functions and packages
RevoR• Performance enhanced R interpreter• Based on open source R• Adds high-performance math
Available On:• PlatformTM LSFTM Linux®
• Microsoft® HPC Clusters• Microsoft Azure Burst• Windows® & Linux Servers• Windows & Linux Workstations• Teradata® Database• IBM® Netezza®
• IBM BigInsightsTM
• Cloudera Hadoop®
• Hortonworks Hadoop• Intel® Hadoop
Rev
oR
DevelopR
DeployR
24
25
Big Data Speed @ Scale with Revolution R Enterprise (RRE)
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
First, we enhance and accelerate the Open Source
R interpreter.
26
Open Source R Performance: Multi-threaded Math
OpenSource R
Revolution R Enterprise
Computation (4-core laptop) Open Source R Revolution R Speedup
Linear Algebra1
Matrix Multiply 176 sec 9.3 sec 18x
Cholesky Factorization 25.5 sec 1.3 sec 19x
Linear Discriminant Analysis 189 sec 74 sec 3x
General R Benchmarks2
R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x
R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php2. http://r.research.att.com/benchmarks/
Customers report 5-50x performance improvements
compared to Open Source R — without changing any code
Rev
oR
DevelopR
DeployR
R+
CR
AN
DistributedR
ScaleR
ConnectR
The Platform Step by Step:Parallelization & Data Sourcing ConnectR
• High-speed & direct connectors
Available for:• High-performance XDF• SAS, SPSS, delimited & fixed format
text data files• Hadoop HDFS (text & XDF)• Teradata Database & Aster• EDWs and ADWs• ODBC
ScaleR• Ready-to-Use high-performance
big data big analytics • Fully-parallelized analytics• Data prep & data distillation• Descriptive statistics & statistical
tests• Correlation & covariance matrices• Predictive Models – linear, logistic,
GLM• Machine learning• Monte Carlo simulation• NEW Tools for distributing
customized algorithms across nodes
DistributedR• Distributed computing framework• Delivers portability across platforms
Available on:• Windows Servers• Red Hat and NEW SuSE Linux Servers• IBM Platform LSF Linux• Microsoft HPC Clusters• Microsoft Azure Burst• NEW Teradata Database• NEW Cloudera Hadoop• NEW Hortonworks Hadoop
27
28
Big Data Speed @ Scale with Revolution R Enterprise (RRE)
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
Second, we built a platform for hosting R with Big Data on a variety of massively parallel platforms.
Revolution R EnterprisePowering Next Generation Analytics
COMBINE INTERMEDIATE RESULTS
29
30
SAS HPA Speed comparison* Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
Revolution R is faster on the same amount of data, despite using approximately a 20 th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
Analytics Layer: High Performance Big Data Analytics with ScaleR
R Data Step DescriptiveStatistics
StatisticalTests
Sampling
PredictiveModeling
DataVisualization
MachineLearning
Simulation
31
Company Confidential – Do not distribute
32
ScaleR: Fast Parallel External Memory Algorithms
Data import – Delimited, Fixed, SAS, SPSS, OBDC
Variable creation & transformation
Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category (means,
sums) Use any of the functionality of
the R language to transform and clean data row by row!
Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product
matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard
tables & long form) Marginal Summaries of Cross
Tabulations
Chi Square Test t-Test F-Test Plus 100’s of other tests
available in R!
Data Prep, Distillation & Descriptive Analytics
Subsample (observations & variables)
Random Sampling High quality, fast, parallel
random number generators
R Data Step Statistical Tests
Sampling
Descriptive Statistics
33
ScaleR: Fast Parallel External Memory Algorithms
Covariance, Correlation, Sums of Squares (cross product matrix for set variables) matrices
Multiple Linear Regression Generalized Linear Models (GLM) - All
exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.
Logistic Regression Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models
Histogram Line Plot Lorenz Curve ROC Curves (actual data and predicted
values) Plus numerous tools in R and ScaleR
to generate big data visualizations
K-Means
Statistical Modeling
Decision Trees Decision Forests
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
High quality, fast, parallel random number generators
Use the rich functionality of R for simulations
The Power of Revolution R EnterprisePerformance & Scalability
R + CRAN
Fast Math Libraries
Memory Management
Multi-Threaded Execution
Grid Processing
Parallelized Algorithms
Parallelized User Code
In-Database Execution
Open Source Leverage latest innovation
In-Hadoop Execution
Va l ue
RevoR 3-50X faster
DistributedR Effective memory utilization
DistributedR Powerful divide & conquer
DistributedR Maximizes computation
ScaleR Labor saving power
ScaleR Leverage CRAN
ScaleR Moves computation to data
ScaleR Moves computation to data
34
35
Why Teradata And Revolution R Enterprise?
Teradata User Demand
Data Movement Penalty Growing
New Analytics Requiring MPP Approach
R Popularity
Open Source Limitations
Arrival of Teradata v14.10
36
Revolution Analytics coupled with the Teradata Unified Data Architecture accelerates
big data analytics using the widely-accepted R language.
Available Today: Scalable R analytics on servers
connected to Teradata High speed, parallel data transfer, 5x
faster than RODBC Integrated parallel analytics solution
Company Confidential
High-SpeedTPT Connector
+
Revolution R Enterprise 6.2
Upcoming Capabilities (4Q13) Parallel R in-database for big data
analytics on Teradata R programmers can immediately build
parallel R models completely in R Revolution parallel in-database
algorithms exclusively available on Teradata Teradata Version 14.0
Teradata Version 14.10
+Revolution R Enterprise V7
37
Introducing Revolution R Enterprise Version 7 on Teradata Database
New Teradata Table Operators
New Parallelized Algorithms
In-Database Execution of Parallelized Algorithms
Executes R Scripts From R Workstations or Servers
Provides Orders of Magnitude Performance Gains
Supports Multiple Platforms in UDA
Available Late 2013
38
Revolution Analytics in the UDA
UNIFIED DATA ARCHITECTUREWith Revolution R Enterprise
RODBC
Seamless use of R analytics across the Teradata UDA
39
HOW DOES IT WORK?
Transparent Parallelization of Analytical, Predictive Modeling and Machine Learning in Teradata
40
Understanding R’s Compute Workload
Compute Burden from Script or Command
Computational Workload Breakdown
Compute Burden from Algorithmic Computations
Algorithms99.xxx%
R Script < 1%
41
ScaleR PEMAs: High Performance Analytical Algorithms
Users Script Calls ScaleR PEMA
– No Unique Code or Setup for Parallelism
– ScaleR Algorithms are “just another R package”
– Using PEMAs is Transparent, Automatic, Fast and Scales
Linearly
PEMAs Transparently Parallelize Algorithm Execution
– Parallelized Versions of Statistics, Predictive Modeling and
Machine Learning Algorithms
– PEMAs Transparently Distribute Computations Across AMPs
– Results are Consolidated Into A Single Result Set
– Provides Write Once Deploy Anywhere (WODA) Portability
42
Transparent to the Script
Transparent Distributed Computing with RRE ScaleR
In Revolution R Enterprise:
Script Calls ScaleR PEMA Algorithm Executes
Algorithm Returns to Script Script Continues Execution
Algorithm Starts A Master Process Master Identifies Environment
Threading?
Cores?
Chips?
Distributed Nodes?
Master Initializes Algorithm Prepares Instructions for Nodes
Master Executes Table Operators In Each VAMP
VAMPs process each data segment
Table Operator runs in each VAMP
Table Operator returns Intermediate Result Object (IRO) to master process
Master Process Combines IROsReturns Consolidated Answer to Script
43
ScaleR PEMAs on Teradata:Transparent Distribution of R Analytics
For Each Call to a ScaleR Algorithm:– One Request
– Many Subtasks
– One Answer
AMPs
CorporateApplications
Extended Stored Procedure
Desktops & Servers
Teradata Database
+Revolution
R Enterprise
ODBC
Revolution R Enterprise
Revolution R Enterprise
Table Operators
Revolution R Enterprise EcosystemPower of Integration
46
Deployment / Consumption
Data / Infrastructure
Advanced Analytics
ETL
SI / Service
Corios
MSP / DSP
Rev
oR
R+
CR
AN
DistributedR
ScaleR
ConnectR
DeployR• Web services software
development kit• Integrates R Into application
infrastructures
Capabilities:• Invokes R Scripts from
web services calls• RESTful interface for
easy integration• Works with leading desktop
& BI tools
DevelopR• Freely-available R algorithms• Callable by RevoR• Embeddable in R scripts
Available on:• Can be called by RevoR• Can be run singe-node
using RevoR• Analyze large data using
RDataStep package• Run on multiple nodes using
rxEXEC package
The Platform Step by Step:Tools & Deployment
47
DevelopR DeployR
48
DevelopR Integrated Development Environment
Script with type ahead and code snippets Solutions window
for organizing code and data
Packages installed and
loaded
Objects loaded in the R
Environment
Object details
Sophisticated debugging with breakpoints ,
variable values etc.
http://www.revolutionanalytics.com/demos/revolution-productivity-environment/demo.htm
Seamless Bring the power of R to any web enabled application
Simple Leverage common APIs including JS, Java, .NET
Scalable Robustly scale user and compute workloads
Secure Manage enterprise security with LDAP & SSO
Data Analysis
Business Intelligence
Mobile Web Apps
Cloud / SaaS
R / Statistical Modeling Expert
DeployR
DeploymentExpert
49
On-demand sales forecasting
Real-time social media sentiment
analysis
Create Custom, On-Demand Analytical AppsSome Examples:
Leveraging the power of R from Microsoft tools
50
Alteryx and Revolution Analytics
Delivering Enterprise-Scale Predictive Analytics to Line
of Business Analysts
Enabling a Broader Audience to Harness the
Universe of R
Empowering Analysts with Easy-to-Use Predictive Tools combined with the
Leading R Platform
Making Predictive Analytics More Accessible and Scalable
51
52
Summary. R is Hot.
– Most Broadly Used Analytical Language
– Its Popularity Addresses Critical Talent Gap
– Vast Functionality Via CRAN
– R Needs a Platform For Big Data Big Analytics Revolution Provides Enterprise-Capable Platforms for R.
– High Performance.
– Scalable via Transparent Distributed Execution
– Portable – Write Once Deploy Anywhere - WODA
– Commercial Support & Services Cut Project Risks Teradata + Revolution Provide a Robust Solution
– Teradata provides stable, high-performane big data environment
– Revolution provides speed, scale, portability and stability for the enterprise
53
www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR
The leading commercial provider of software and support for the popular open source R statistics language.
Next steps?
top related