a workshop on r

Post on 21-Apr-2017

5.504 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Pre- Placement Workshopin R and Analytics

Delhi School of Economics 2014

Ajay Ohri

Hi , I am Ajay Ohri

Agenda

• Try and learn R in 12 hours

Agenda

• Try and learn R in 12 hours• Get an introduction to Analytics

Agenda

• Try and learn R in 12 hours• Get an introduction to Analytics• Be better skilled for Analytics as a career

Agenda

• Try and learn R in 12 hours• Get an introduction to Analytics• Be better skilled for Analytics as a career (?)

Training Plan

• DAY 1– Session 1 -2.5 hours– Session 2 -3.5 hours

• DAY 2– Session 1-2.5 hours– Session 2 -3.5 hours

Instructor

• Author of R for Business Analytics• Author of R for Cloud Computing ( An

approach for Data Scientists)• 10+ yrs in Analytics and 6+ years in R• Founder, Decisionstats.com

The Audience

Breakup – Demographics and Background

Expectations from each other

• From Instructor– Your turn to speak

Expectations from each other

• From Instructor

• From Audience– mobile phones should be kindly switched off

• Yes, this includes Whatsapp– Ask Questions at end of session– Take Notes

Day 1 Session 1– Introductions

• Introduction to Analytics• Introduction to R• Interfaces in R

– Demos in R (Maths, Objects,etc)

• Break 1- – Installation, Trouble Shooting, Questions

Day 1 Session 2– Recap

• Input of Data• Inspecting Data Quality• Investigating Data Issues

– Demos in R • Data Input,• Data Quality, • Data Exploration)

• Break 2- – Questions

Day 2 Session 1– Revision

• Exploring Data• Manipulating Data• Visualization of Data• Demos in R

• Data Exploration,• Data Manipulation, • Data Visualizations

• Break 1– Questions

Day 2 Session 2– Recap

• Data Mining• Regression Models• Advanced Topics• Demos in R

• Data Mining,• Model Building, • Advanced Topics

• Summary and Conclusion

• Break 2– Questions

Analytics

• What is analytics?• Where is it used?• How is it used?• What are some good practices?

Analytics

• What is analytics? – Study of data for helping with decision making using software

• Where is it used?• How is it used?• What are some good practices?

Analytics

• What is analytics?• Where is it used? – Industries (like Pharma,

BFSI, Telecom, Retail)• How is it used? –Use statistics and software• What are some good practices?

Analytics

• What is analytics?• Where is it used?• How is it used?• What are some good practices? –

– Learn one new thing extra from your competition every day. This is a fast moving field.

– Etc.

What is Data Science

Other Analytics Software

• SAS (Base) et al• JMP• SPSS

• Python• Octave• Clojure• Julia(?)

Other Analytics Software

• SAS (Base) et al• JMP• SPSS

• Python• Octave• Clojure• Julia(?)

R

What is R?http://www.r-project.org/

• Language– Object oriented– Open Source– Free– Widely used

the concept of "objects" that have data fields(attributes that describe the object) and associated procedures known as methods. Objects, which are usually instances of classes, are used to interact with one another to design applications and computer programs

Pre Requisites

• Installation of Rhttp://cran.rstudio.com/bin/windows/base/

• R Studio

• R Packages

Pre Requisites

• Installation of R– Rtools– http://cran.rstudio.com/bin/windows/Rtools/

• R Studio

• R Packages

Pre Requisites

• Installation of R– RTools

• R Studiohttp://www.rstudio.com/products/rstudio/download/

• R Packages

Pre Requisites

• Installation of R– RTools

• R Studiohttp://www.rstudio.com/products/rstudio/download/

• R Packagesabout eight packages supplied with the R distribution and many more are available through the CRAN family of Internet

sites covering a very wide range of modern statistics.

Pre Requisites• Installation of R

– RTools

• R Studiohttp://www.rstudio.com/products/rstudio/download/

• R Packages

install.packages(),update.packages(),library()Packages are installed once, updated periodically, but loaded every time

Pre Requisites• R• R Studio• R Tools (for Windows)• JAVA (JRE)

– R Packages (need Internet connection)– Rcmdr

• All packages asked at startup• Epack plugin

• KMggplot2plugin

– rattle• A few packages that are asked when using rattle• GTK+ (needs internet)

– Deducer– ggmap– Hmisc– arules– MASS

Interfaces to R

• ConsoleDefaultCustomization

• IDE

• GUI

Demo- Basic Math on R Console

• +• -• Log• Exp• *• /• ()

• mean• sum• sd• log• median• exp

Demo- Basic Math on R Console

• +• -• Log• Exp• *• /• ()

Hint- Ctrl +L clears screen

Demo- Basic Objects on R Console

• +• -• Log• Exp• *• /• ()

Hint- Up arrow gives you lasttyped command

Functions- ls() – what objects are hererm(“foo”) removes object named foo

Assignment Using = or -> assigns object names to values

Functions and Loops

• Loops for (number in 1:5){ print (number) }

Functions and Loops

• Functionfunctionajay=function(a)(a^2+2*a+1)

Hint: Always match brackets

Each ( deserves a )

Each { deserves a }Each [ deserves a ]

Demo- Basic Objects on R Console

• +• -• Log• Exp• *

This is made more clear in next slide

Hint- Up arrow gives you lasttyped command

Functions- class() gives classdim() gives dimensionsnrow() gives rowsncol() gives columnslength() gives length

str() gives structure

Demo- Datasets on R Console

Hint- use data() to list all loaded datasets

Demo- Datasets on R Console

Hint- use data() to list all loaded datasetslibrary(FOO) loads package “FOO”

R- Basic Functions

– ls()– rm()

– str()– summary()

– getwd()– setwd()– dir()

– read.csv()

Day 1 Session 2– Recap

• Input of Data• Inspecting Data Quality• Investigating Data Issues

– Demos in R • Data Input,• Data Quality, • Data Exploration)

• Break 2- – Questions

read.table()

Statistical formats

• read.spss from foreign package• read.sas7bdat from sas7bdat package

From Databases

The RODBC package provides access to databases through an ODBC interface.

The primary functions are • odbcConnect(dsn, uid="", pwd="") Open a connection

to an ODBC database• sqlFetch(channel, sqltable) Read a table from an ODBC

database into a data frame

Hint- a good site to learn R http://www.statmethods.net

A Detour to SQL

From Web (aka Web Scraping)

• readlines Hint : R is case sensitivereadlines is not the same as readLines

Hint : Use head() and tail() to inspect objects

Other packages are XML and CurlCase Study- http://decisionstats.com/2013/04/14/using-r-for-cricket-analysis-rstats/

Inspecting Data Quality

• head()• tail()• names()• str()• objectname[I,m]• objectname$variable

Hint- Try this code please

data(mtcars)head(mtcars,10)tail(mtcars,5)names(mtcars)str(mtcars)mtcars[1,]mtcars[,2]mtcars[2,3]mtcars$cyl

Inspecting Data Quality: Demo

Inspecting Data Quality: Demo

Data Selection

• object[l,m] gives the value in l row and m column

• object[l,] will give all the values in l row• object$varname gives all values of varname • subset helps in selection

Data Selection: Demo

Questions- How do I use multiple conditions (AND OR)Can I do away with subset functionHow do I select random sample

Useful Link- http://decisionstats.com/2013/11/24/50-functions-to-clear-a-basic-interview-for-business-analytics-rstats/

Day 2 Session 1– Revision

• Exploring Data• Manipulating Data• Visualization of Data• Demos in R

• Data Exploration,• Data Manipulation, • Data Visualizations

• Break 1– Questions

Good coding practices

• Use # for comment• Use git for version control• Use Rstudio for multiple lines of code

Functions in R

• custom functions• source code for a function• Understanding help ? , ??

Packages in R

• CRAN• CRAN Views• R Documentation

Documentation in R

• Help ? And ??• CRAN Views• Package Help• Tips for Googling

– Stack Overflow– Email Lists– Twitter– R Bloggers

Interfaces to R

• Console

• IDER Studio

• GUIGraphical User Interface

Graphical Interfaces to R

• R Commander

• Rattle

• Deducer

Installation of R Commander

Overview of R Commander

DemoR Commander – 3D Graphs

Installation of Rattle

Installation of Rattle

Installation of Rattle

Installation of Rattle

Installation of Rattle

• GTK+ Installation Necessary

• Install other packages when prompted

Installation of Rattle

• GTK+ Installation Necessary

• Install other packages when prompted

Overview of Rattle

Demo Rattle

Installation Deducer (with JGR)

Installation Deducer (with JGR)

Installation Deducer (with JGR)

Installation Deducer (with JGR)

Installation Deducer (with JGR)

Installation Deducer (with JGR)

Installation Deducer (with JGR)

Overview of Deducer (with JGR)

Demo Deducer

• data()• data(mtcars)

Data Exploration

• summary()• table()• describe() (Hmisc)• summarize()(Hmisc)

Hint- Try this code please

summary(mtcars)table(mtcars$cyl)

library(Hmisc)describe(mtcars)

summarize(mtcars$mpg,mtcars$cyl,mean)

CLASS WORK- •Use table command for two variables•Summarize mtcars$mpg for two variables (cyl , gear)•Try and find min and max for the same

Data Exploration

• missing values are represented by NA in R• Demo

– is.na– na.omit– na.rm

Data Visualization

Notes- Explaining Basic Types of Graphs

Customizing GraphsGraph OutputAdvanced GraphsFacets,

Grammar of GraphicsData Visualization Rules

Data Manipulation Demo

Notes-1. gsub2. gsub with

escape 3. as operator4. is operator

Text Manipulation

Functions-ncharsubstrpaste

Date Manipulation

Date Manipulation

Hit escape to escape the + signs+ signs occur due to unclosed quotes or brackets

Use ? help generously

Class WorkWhat is your age in days as of today?What is your age in weeks as of today?Hint- > age2=difftime(Sys.Date(),dob2,units='weeks')> age2Time difference of 1959.286 weeks

Data Output

• Graphical Output • Numerical Output (aggregation)

Data Output

• Graphical Output • Numerical Output (aggregation)

Data Output

• Graphical Output

Data Output

• Use objects to summarize• Use write.csv• Use setwd() to set location of output

EconometricsComing up Regression

Correlation

Regression

Notes-Correlation is not causationHow do we determine which is dependent and which are independent variables

Regression

Regression using R Commander

Lies True Lies and Statistics

• Anscombe -case study

Regression Recap

• cor• lm• anova• summary and plot of lm object• residuals• p value

– vif– heteroskedascity– outliers

Propensity Modeling in Industry

• Response Rates• Lift• Test and Control groups

Day 2 Session 2– Recap

• Data Mining• Regression Models• Advanced Topics• Demos in R

• Data Mining,• Model Building, • Advanced Topics

• Summary and Conclusion

• Break 2– Questions

Data Mining

• Rattle– association analysis– cluster analysis– modeling

Rattle

• Analyze wine

Rattle

• Analyze wine

Rattle

• Analyze wine

Rattle

• Cluster Analysis

Data Mining

• Brief Introduction

– Affinity analysis is a data analysis and data mining technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling,

Rattle

• Brief Introduction– market basket analysis – Market basket analysis might tell a retailer that customers often

purchase shampoo and conditioner together, so putting both items on promotion at the same time would not create a significant increase in revenue, while a promotion involving just one of the items would likely drive sales of the other

Rattle

• Brief Introduction– association rules– if butter and bread are bought, customers also buy milk

Example database with 4 items and 5 transactionstransactio

n ID milk bread butter beer

1 1 1 0 02 0 0 1 03 0 0 0 14 1 1 1 05 0 1 0 0

Rattle

• Brief Introduction– association rules– the itemset (milk,bread->butter) has a support of 20% since it occurs in 20% of all

transactions (1 out of 5 transactions).– the itemset (milk,bread->butter) has a confidence of 50% since it occurs in 50% of all

such transactions (1 out of 2 transactions).–

Rattle

• Brief Introduction– association rules

Regression Models

• lm function• Understanding output• Diagnostics

– homoskedasticity – Multicollinearity – p value– Residuals

Advanced Topics :Demos

• Time Series Analysis (use epack plugin) http://decisionstats.com/2010/10/22/doing-time-series-using-a-r-gui/

Advanced Topics :Demos

• Advanced Data Visualization ( kmggplot2 plugin)

http://decisionstats.com/2012/05/21/new-rcommander-with-ggplot-rstats/

Advanced Topics :Demos

Social Network Analysis (sna)

Facebookhttp://decisionstats.com/2014/05/10/analyzing-facebook-networks-using-rstats/

Twitterhttp://www.slideshare.net/ajayohri/twitter-analysis-by-kaify-rais

Advanced Topics :Demos

• Spatial Analysis• ggmap demo• http://decisionstats.com/2013/08/19/the-wonderful-ggmap-package-for-spatial-analysis-in-r-rstats/

• rmaps• http://rcharts.io/viewer/?9223554#.Uw4hOPmSySp

Thank You

• http://linkedin.com/in/ajayohri• ohri2007@gmail.com

top related