big data analytics with r
DESCRIPTION
Great Wide Open - Day 1 Derek Norton - Revolution Analytics 11:15 AM - Operations 2 (Big Data)TRANSCRIPT
Big Data Analytics with R
Derek McCrae Norton, Senior Sales Engineer
April 2, 2014
Agenda Introduction
Big Data
Analytics
R
Revolution R Enterprise
Synergy
Conclusion
© 2013 Revolution Analytics
Who are you anyway? Statistician
– My degrees are all in statistics.
Consultant
– My experience has been mostly in Marketing Analytics focusing on Predictive
Analytics.
Sales Engineer
– Still consulting, just with a much heavier emphasis on client interaction.
Founder/Director Atlanta R Users Group.
– Shameless plug. Please join if interested.
– http://www.meetup.com/R-Users-Atlanta/
Husband, Father, Outdoorsman, Serial Hobbyist, …
© 2013 Revolution Analytics
Big Data
© 2013 Revolution Analytics
Big Data and Big Opportunities
© 2013 Revolution Analytics
“Big data is data that
exceeds the processing
capability of conventional
database systems”
Edd Dumbill
O’Reilly Radar*, Jan 2012
Worldwide data created and replicated, Zettabytes
1 2
35
* radar.oreilly.com/2012/01/what-is-big-data.html
What is Big Data?
Big Data is a loosely defined term used to describe
data sets so large and complex that they become
awkward to work with using standard statistical
software.
© 2013 Revolution Analytics
Snijders, Matzat, & Reips (2012)
Does Big Data Mean Hadoop? The short answer is no.
The longer answer is maybe.
Hadoop adoption is
turning that maybe
into a probably.
© 2013 Revolution Analytics
?
Analytics
© 2013 Revolution Analytics
What is Analytics?
Analytics is the combination of mathematical,
statistical, and heuristic techniques to glean useful
insights from data and to implement actions derived
from those insights.
© 2013 Revolution Analytics
Derek McCrae Norton
Analytics The current buzzword is “Data Science,” but I
don’t really agree with that nomenclature.
– What statistician, analyst, (data scientist) actually
follows the scientific method?
That being said, the current definition of “Data Science”
is a pretty good surrogate for what we are discussing.
Whatever descriptors you use, one thing is clear… You must use
something to help you carry out the actual work.
– R, Python, SAS, etc.
– RDBMS, Hadoop, etc.
© 2013 Revolution Analytics
© 2013 Revolution Analytics
What is the R language? A Platform…
– A Procedural Language for Stats, Math and Data Science
– A Complete Data Visualization Framework
– Provided as Open Source
A Community…
– 2M+ Users with the Skill to Tackle Big Data Statistical and Numerical Analysis and
Machine Learning Projects
– Active User Groups Across the World
An Ecosystem
– CRAN: 5000+ Freely Available Packages
– Applicable to Big Data if scaled
© 2013 Revolution Analytics
THE R USER COMMUNITY
A brief history of R 1993: Research project in Auckland, NZ
– Ross Ihaka and Robert Gentlemen
1995: Released as open-source software
– Generally compatible with the “S” language
1997: R core group formed
2000: R 1.0.0 released
2004: First international
user conference in Vienna
2013: R 3.0.0 released
© 2013 Revolution Analytics
R is Free Open Source, licensed under GPL (like Linux!)
– Free as in beer
– Free as in freedom
Flexible
Open for integration
– Data (SAS, SPSS, Excel, SQL Server, Oracle, …)
– Systems (applications, webservers, …)
Broad user-base
– De-facto standard for data analysis teaching
© 2013 Revolution Analytics
16
R is exploding in popularity & function
Web Site Popularity Number of links to main web site
R
SAS
SPSS
S-Plus
Stata
Scholarly Activity Google Scholar hits (’05-’09 CAGR)
R 46%
SAS -11%
SPSS -27%
S-Plus 0%
Stata 10%
Internet Discussion Mean monthly traffic on email discussion list
R
SAS
Stata
SPSS
S-Plus
Package Growth Number of R packages listed on CRAN
4,332 as of
Feb 2013
© 2013 Revolution Analytics
So why isn’t everyone using R?
“The best thing about R is that it was developed by
statisticians. The worst thing about R is that it was
developed by statisticians.”
© 2013 Revolution Analytics
Bo Cowgill
Google (at SF R Meetup)
Otherwise R is Great! Right? Who here has used R?
– Thoughts?
Who has never seen this?
Who here has more than 1 core/processor?
Who has ever used r-help?
– ’They’ did write documentation that told you that Perl was needed, but
‘they’ can’t read it for you. - Brian D. Ripley, R-help (February 2001)
– This is all documented in TFM. Those who WTFM don’t want to have to
WTFM again on the mailing list. RTFM. - Barry Rowlingson, R-help
(October 2003)
© 2013 Revolution Analytics
What is Revolution R Enterprise?
© 2013 Revolution Analytics
Motivators
© 2013 Revolution Analytics
Big Data In-memory bound Hybrid memory & disk
scalability
Operates on bigger
volumes & factors
Speed of
Analysis
Single threaded Parallel threading Shrinks analysis time
Enterprise
Readiness
Community support Commercial support Delivers full service
production support
Analytic
Breadth &
Depth
5000+ innovative
analytic packages
Leverage open source
packages plus Big Data
ready packages
Supercharges R
Commercial
Viability
Risk of deployment of
open source
Commercial license Eliminate risk with open
source
Introducing Revolution R Enterprise (RRE) The Big Data Big Analytics Platform
DistributedR
DevelopR DeployR
ScaleR
ConnectR
Big Data Big Analytics Ready
– Enterprise readiness
– High performance analytics
– Multi-platform architecture
– Data source integration
– Development tools
– Deployment tools
© 2013 Revolution Analytics
The Platform Step by Step: R Capabilities
R+CRAN • Open source R interpreter
• UPDATED R 3.0.2
• Freely-available R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing R scripts, functions and packages
RevoR • Performance enhanced R interpreter
• Based on open source R
• Adds high-performance math
Available On: • PlatformTM LSFTM Linux®
• Microsoft® HPC Clusters
• Windows® & Linux Servers
• Windows & Linux Workstations
• IBM® Netezza®
• NEW Cloudera Hadoop®
• NEW Hortonworks Hadoop
• NEW Teradata® Database
• Intel® Hadoop
• IBM BigInsightsTM
© 2013 Revolution Analytics
The Platform Step by Step: Parallelization & Data Sourcing ConnectR
• High-speed & direct connectors
Available for: • High-performance XDF
• SAS, SPSS, delimited & fixed format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBC
ScaleR • Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Correlation & covariance matrices
• Predictive Models – linear, logistic, GLM
• Machine learning
• Monte Carlo simulation
• NEW Tools for distributing customized algorithms across nodes
DistributedR • Distributed computing framework
• Delivers portability across platforms
Available on:
• Windows Servers
• Red Hat and NEW SuSE Linux Servers
• IBM Platform LSF Linux
• Microsoft HPC Clusters
• NEW Teradata Database
• NEW Cloudera Hadoop
• NEW Hortonworks Hadoop © 2013 Revolution Analytics
A single package
(RevoScaleR)
DeployR • Web services software
development kit for integration analytics via Java, JavaScript or .NET APIs
• Integrates R Into application infrastructures
Capabilities:
• Invokes R Scripts from web services calls
• RESTful interface for easy integration
• Works with web & mobile apps, leading BI & Visualization tools and business rules engines
DevelopR • Integrated development
environment for R
• Visual ‘step-into’ debugger
Available on:
• Windows
The Platform Step by Step: Tools & Deployment
DevelopR DeployR
© 2013 Revolution Analytics
DistributedR
ScaleR
ConnectR
DeployR
Write Once. Deploy Anywhere.
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
In the Cloud Amazon AWS
Workstations & Servers Desktop Server
Clustered Systems IBM Platform LSF Microsoft HPC
EDW Teradata
Hadoop Hortonworks Cloudera
© 2013 Revolution Analytics
Synergy
© 2013 Revolution Analytics
Put it all together Talent fresh out of school knows R.
RRE is R plus more.
RRE provides a unified way of carrying out analytics (small or big).
RRE code is portable…
© 2013 Revolution Analytics
Scale and Portability Set “compute context” to define hardware (one line of code)
– Native job-scheduler handles distribution, monitoring, failover etc.
Same code runs on other supported architectures
– Just change compute context
© 2013 Revolution Analytics
42 seconds instead of 6 minutes on the local machine
References 1. Snijders, C., Matzat, U., & Reips, U.-D. (2012). ‘Big Data’: Big gaps of
knowledge in the field of Internet. International Journal of Internet
Science, 7, 1-5. http://www.ijis.net/ijis7_1/ijis7_1_editorial.html
2. Conway, D, THE DATA SCIENCE VENN DIAGRAM
© 2013 Revolution Analytics