data analysis in r for beginners

Post on 10-Jan-2017

174 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Analysis in .

for BeginnersAlton Alexander

Data Science Consultant

Why R?• R is open source – like python not like SAS

• Out of the box R is single machine, in memory statistical computing engine– Download from https://www.r-project.org/

• Use an IDE– R Studio https://www.rstudio.com/

– Revolution Analytics (MSFT)

– Jython (ipython)

R studio

Download

Overview

Essential Learning Resources

A new book for learning R

Q: What have you tried and what works?

Topics• Data ingestion• Manipulation• Summary and exploration• Writing Reports• Interactive visualization and dashboarding• Predictive Modeling & Forecasting• Big Data Integrations

Demo

Options data

R studio

Data ingestion

• Load data– Load.csv()

– library(RJDBC)

– library(RODBC)

Data Structures and Manipulation

• Another major reason for using R– Ability to work with data in Data Frames– Like pandas in python and data tables in SAS

• Reasons for doing data manipulation (munging)– Feature extraction– ETL– Data cleansing– Pivots, stack/unstack, aggregate, groupby, reshape

Summary and Exploration

• Powerful summary functions for programmatically quantifying datasets

• Functions include:– Summary(), hist(), levels(), aggregate()

Interactive Visualizationand Dashboarding

• Shiny from Rstudio• Like tableau

– Local and server options

• Much more customizable, more coding, no GUI or click to edit

• But you can bring in powerful libraries to build web apps comparatively fast

Predictive Modeling & Forecasting• Examples

– Customer segmentation• Unsupervised classification

– Marketing mix models• Explain the coefficients

– Attribution modeling• Supervised time series of events

– Multivariate testing • (AB tests with statistical significance, ANOVA)

– Lead scoring • P2B Models, topic of interest, propensity to buy, expected spend

5 Libraries for Machine LearningAllowing the machine to capture complexity:1. gbm [Gradient Boosting Machine]2. randomForest [Random Forest]3. e1071 [Support Vector Machines]

Taking advantage of high-cardinality categorical or text-data:4. glmnet [Lasso and Elastic-Net Regularized Generalized Linear Models]5. tau [Text Analysis Utilities]

Big Data Integration

• Single laptop is often sufficient– Millions of rows on a 32GB i7 laptop

• Scale using a larger server– Often sufficient but has limitations (100s of GB)

• Clustered compute engine– Algorithm considerations to affect performance

RServer

• For datasets that don’t fit in memory or for convenience there is a SERVER option– A shared compute engine

– Shares resources

– Think +100 GB of RAM

Big Data Integration - Frameworks

• H2O.ai• SparkR• Revolution Analytics• In DB processing

– Applying lead score or segmentation model in real time

– Spark, teradata, vertica

Why R? In High Demand Nationally

Get Alton’s FREE Reports!

Go to http://frontanalysis.com/bigdatameetup/

Complete the survey including your email

I’ll email you the two reports:

1. Anonymized Summary of the Survey2. LinkedIn Job Suggestions for a Utah Data Scientist

top related