data analysis in r for beginners
Post on 10-Jan-2017
167 views
Embed Size (px)
TRANSCRIPT
Data Analysis in .
for BeginnersAlton Alexander
Data Science Consultant
Why R? R is open source like python not like SAS Out of the box R is single machine, in memory
statistical computing engine Download from https://www.r-project.org/
Use an IDE R Studio https://www.rstudio.com/ Revolution Analytics (MSFT) Jython (ipython)
https://www.r-project.org/https://www.rstudio.com/
R studio
Download
Overview
Essential Learning Resources
A new book for learning R
Q: What have you tried and what works?
Topics Data ingestion Manipulation Summary and exploration Writing Reports Interactive visualization and dashboarding Predictive Modeling & Forecasting Big Data Integrations
Demo
Options data
R studio
Data ingestion
Load data Load.csv() library(RJDBC) library(RODBC)
Data Structures and Manipulation
Another major reason for using R Ability to work with data in Data Frames Like pandas in python and data tables in SAS
Reasons for doing data manipulation (munging) Feature extraction ETL Data cleansing Pivots, stack/unstack, aggregate, groupby, reshape
Set Theory
SQL joins and their results
merge, sqldf in Rhttp://www.r-bloggers.com/manipulating-
data-frames-using-sqldf-a-brief-overview/
http://www.r-bloggers.com/manipulating-data-frames-using-sqldf-a-brief-overview/http://www.r-bloggers.com/manipulating-data-frames-using-sqldf-a-brief-overview/http://www.r-bloggers.com/manipulating-data-frames-using-sqldf-a-brief-overview/
Summary and Exploration
Powerful summary functions for programmatically quantifying datasets
Functions include: Summary(), hist(), levels(), aggregate()
Interactive Visualizationand Dashboarding
Shiny from Rstudio Like tableau
Local and server options Much more customizable, more coding, no GUI or
click to edit But you can bring in powerful libraries to build
web apps comparatively fast
Predictive Modeling & Forecasting Examples
Customer segmentation Unsupervised classification
Marketing mix models Explain the coefficients
Attribution modeling Supervised time series of events
Multivariate testing (AB tests with statistical significance, ANOVA)
Lead scoring P2B Models, topic of interest, propensity to buy, expected spend
5 Libraries for Machine LearningAllowing the machine to capture complexity:1. gbm [Gradient Boosting Machine]2. randomForest [Random Forest]3. e1071 [Support Vector Machines]
Taking advantage of high-cardinality categorical or text-data:4. glmnet [Lasso and Elastic-Net Regularized Generalized Linear Models]5. tau [Text Analysis Utilities]
http://cran.r-project.org/web/packages/gbm/http://cran.r-project.org/web/packages/gbm/http://cran.r-project.org/web/packages/randomForest/http://cran.r-project.org/web/packages/randomForest/http://cran.r-project.org/web/packages/e1071/http://cran.r-project.org/web/packages/e1071/http://cran.r-project.org/web/packages/glmnet/http://cran.r-project.org/web/packages/glmnet/http://cran.r-project.org/web/packages/Matrix/http://cran.r-project.org/web/packages/Matrix/
Big Data Integration
Single laptop is often sufficient Millions of rows on a 32GB i7 laptop
Scale using a larger server Often sufficient but has limitations (100s of GB)
Clustered compute engine Algorithm considerations to affect performance
RServer
For datasets that dont fit in memory or for convenience there is a SERVER option A shared compute engine Shares resources Think +100 GB of RAM
Big Data Integration - Frameworks
H2O.ai SparkR Revolution Analytics In DB processing
Applying lead score or segmentation model in real time
Spark, teradata, vertica
Why R? In High Demand Nationally
Get Altons FREE Reports!
Go to http://frontanalysis.com/bigdatameetup/
Complete the survey including your email
Ill email you the two reports:
1. Anonymized Summary of the Survey2. LinkedIn Job Suggestions for a Utah Data Scientist
http://frontanalysis.com/bigdatameetup/