Transcript
Page 1: Data analysis in R for beginners

Data Analysis in .

for BeginnersAlton Alexander

Data Science Consultant

Page 2: Data analysis in R for beginners

Why R?• R is open source – like python not like SAS

• Out of the box R is single machine, in memory statistical computing engine– Download from https://www.r-project.org/

• Use an IDE– R Studio https://www.rstudio.com/

– Revolution Analytics (MSFT)

– Jython (ipython)

Page 3: Data analysis in R for beginners

R studio

Download

Overview

Page 4: Data analysis in R for beginners

Essential Learning Resources

A new book for learning R

Q: What have you tried and what works?

Page 5: Data analysis in R for beginners

Topics• Data ingestion• Manipulation• Summary and exploration• Writing Reports• Interactive visualization and dashboarding• Predictive Modeling & Forecasting• Big Data Integrations

Page 6: Data analysis in R for beginners

Demo

Options data

R studio

Page 7: Data analysis in R for beginners

Data ingestion

• Load data– Load.csv()

– library(RJDBC)

– library(RODBC)

Page 8: Data analysis in R for beginners

Data Structures and Manipulation

• Another major reason for using R– Ability to work with data in Data Frames– Like pandas in python and data tables in SAS

• Reasons for doing data manipulation (munging)– Feature extraction– ETL– Data cleansing– Pivots, stack/unstack, aggregate, groupby, reshape

Page 10: Data analysis in R for beginners

Summary and Exploration

• Powerful summary functions for programmatically quantifying datasets

• Functions include:– Summary(), hist(), levels(), aggregate()

Page 11: Data analysis in R for beginners

Interactive Visualizationand Dashboarding

• Shiny from Rstudio• Like tableau

– Local and server options

• Much more customizable, more coding, no GUI or click to edit

• But you can bring in powerful libraries to build web apps comparatively fast

Page 12: Data analysis in R for beginners

Predictive Modeling & Forecasting• Examples

– Customer segmentation• Unsupervised classification

– Marketing mix models• Explain the coefficients

– Attribution modeling• Supervised time series of events

– Multivariate testing • (AB tests with statistical significance, ANOVA)

– Lead scoring • P2B Models, topic of interest, propensity to buy, expected spend

Page 13: Data analysis in R for beginners

5 Libraries for Machine LearningAllowing the machine to capture complexity:1. gbm [Gradient Boosting Machine]2. randomForest [Random Forest]3. e1071 [Support Vector Machines]

Taking advantage of high-cardinality categorical or text-data:4. glmnet [Lasso and Elastic-Net Regularized Generalized Linear Models]5. tau [Text Analysis Utilities]

Page 14: Data analysis in R for beginners

Big Data Integration

• Single laptop is often sufficient– Millions of rows on a 32GB i7 laptop

• Scale using a larger server– Often sufficient but has limitations (100s of GB)

• Clustered compute engine– Algorithm considerations to affect performance

Page 15: Data analysis in R for beginners

RServer

• For datasets that don’t fit in memory or for convenience there is a SERVER option– A shared compute engine

– Shares resources

– Think +100 GB of RAM

Page 16: Data analysis in R for beginners

Big Data Integration - Frameworks

• H2O.ai• SparkR• Revolution Analytics• In DB processing

– Applying lead score or segmentation model in real time

– Spark, teradata, vertica

Page 17: Data analysis in R for beginners

Why R? In High Demand Nationally

Page 18: Data analysis in R for beginners

Get Alton’s FREE Reports!

Go to http://frontanalysis.com/bigdatameetup/

Complete the survey including your email

I’ll email you the two reports:

1. Anonymized Summary of the Survey2. LinkedIn Job Suggestions for a Utah Data Scientist


Top Related