a workshop on r
TRANSCRIPT
Pre- Placement Workshopin R and Analytics
Delhi School of Economics 2014
Ajay Ohri
Hi , I am Ajay Ohri
Agenda
• Try and learn R in 12 hours
Agenda
• Try and learn R in 12 hours• Get an introduction to Analytics
Agenda
• Try and learn R in 12 hours• Get an introduction to Analytics• Be better skilled for Analytics as a career
Agenda
• Try and learn R in 12 hours• Get an introduction to Analytics• Be better skilled for Analytics as a career (?)
Training Plan
• DAY 1– Session 1 -2.5 hours– Session 2 -3.5 hours
• DAY 2– Session 1-2.5 hours– Session 2 -3.5 hours
Instructor
• Author of R for Business Analytics• Author of R for Cloud Computing ( An
approach for Data Scientists)• 10+ yrs in Analytics and 6+ years in R• Founder, Decisionstats.com
The Audience
Breakup – Demographics and Background
Expectations from each other
• From Instructor– Your turn to speak
Expectations from each other
• From Instructor
• From Audience– mobile phones should be kindly switched off
• Yes, this includes Whatsapp– Ask Questions at end of session– Take Notes
Day 1 Session 1– Introductions
• Introduction to Analytics• Introduction to R• Interfaces in R
– Demos in R (Maths, Objects,etc)
• Break 1- – Installation, Trouble Shooting, Questions
Day 1 Session 2– Recap
• Input of Data• Inspecting Data Quality• Investigating Data Issues
– Demos in R • Data Input,• Data Quality, • Data Exploration)
• Break 2- – Questions
Day 2 Session 1– Revision
• Exploring Data• Manipulating Data• Visualization of Data• Demos in R
• Data Exploration,• Data Manipulation, • Data Visualizations
• Break 1– Questions
Day 2 Session 2– Recap
• Data Mining• Regression Models• Advanced Topics• Demos in R
• Data Mining,• Model Building, • Advanced Topics
• Summary and Conclusion
• Break 2– Questions
Analytics
• What is analytics?• Where is it used?• How is it used?• What are some good practices?
Analytics
• What is analytics? – Study of data for helping with decision making using software
• Where is it used?• How is it used?• What are some good practices?
Analytics
• What is analytics?• Where is it used? – Industries (like Pharma,
BFSI, Telecom, Retail)• How is it used? –Use statistics and software• What are some good practices?
Analytics
• What is analytics?• Where is it used?• How is it used?• What are some good practices? –
– Learn one new thing extra from your competition every day. This is a fast moving field.
– Etc.
What is Data Science
Other Analytics Software
• SAS (Base) et al• JMP• SPSS
• Python• Octave• Clojure• Julia(?)
Other Analytics Software
• SAS (Base) et al• JMP• SPSS
• Python• Octave• Clojure• Julia(?)
R
What is R?http://www.r-project.org/
• Language– Object oriented– Open Source– Free– Widely used
the concept of "objects" that have data fields(attributes that describe the object) and associated procedures known as methods. Objects, which are usually instances of classes, are used to interact with one another to design applications and computer programs
Pre Requisites
• Installation of Rhttp://cran.rstudio.com/bin/windows/base/
• R Studio
• R Packages
Pre Requisites
• Installation of R– Rtools– http://cran.rstudio.com/bin/windows/Rtools/
• R Studio
• R Packages
Pre Requisites
• Installation of R– RTools
• R Studiohttp://www.rstudio.com/products/rstudio/download/
• R Packages
Pre Requisites
• Installation of R– RTools
• R Studiohttp://www.rstudio.com/products/rstudio/download/
• R Packagesabout eight packages supplied with the R distribution and many more are available through the CRAN family of Internet
sites covering a very wide range of modern statistics.
Pre Requisites• Installation of R
– RTools
• R Studiohttp://www.rstudio.com/products/rstudio/download/
• R Packages
install.packages(),update.packages(),library()Packages are installed once, updated periodically, but loaded every time
Pre Requisites• R• R Studio• R Tools (for Windows)• JAVA (JRE)
– R Packages (need Internet connection)– Rcmdr
• All packages asked at startup• Epack plugin
• KMggplot2plugin
– rattle• A few packages that are asked when using rattle• GTK+ (needs internet)
– Deducer– ggmap– Hmisc– arules– MASS
Interfaces to R
• ConsoleDefaultCustomization
• IDE
• GUI
Demo- Basic Math on R Console
• +• -• Log• Exp• *• /• ()
• mean• sum• sd• log• median• exp
Demo- Basic Math on R Console
• +• -• Log• Exp• *• /• ()
Hint- Ctrl +L clears screen
Demo- Basic Objects on R Console
• +• -• Log• Exp• *• /• ()
Hint- Up arrow gives you lasttyped command
Functions- ls() – what objects are hererm(“foo”) removes object named foo
Assignment Using = or -> assigns object names to values
Functions and Loops
• Loops for (number in 1:5){ print (number) }
Functions and Loops
• Functionfunctionajay=function(a)(a^2+2*a+1)
Hint: Always match brackets
Each ( deserves a )
Each { deserves a }Each [ deserves a ]
Demo- Basic Objects on R Console
• +• -• Log• Exp• *
This is made more clear in next slide
Hint- Up arrow gives you lasttyped command
Functions- class() gives classdim() gives dimensionsnrow() gives rowsncol() gives columnslength() gives length
str() gives structure
Demo- Datasets on R Console
•
Hint- use data() to list all loaded datasets
Demo- Datasets on R Console
•
Hint- use data() to list all loaded datasetslibrary(FOO) loads package “FOO”
R- Basic Functions
– ls()– rm()
– str()– summary()
– getwd()– setwd()– dir()
– read.csv()
Day 1 Session 2– Recap
• Input of Data• Inspecting Data Quality• Investigating Data Issues
– Demos in R • Data Input,• Data Quality, • Data Exploration)
• Break 2- – Questions
read.table()
Statistical formats
• read.spss from foreign package• read.sas7bdat from sas7bdat package
From Databases
The RODBC package provides access to databases through an ODBC interface.
The primary functions are • odbcConnect(dsn, uid="", pwd="") Open a connection
to an ODBC database• sqlFetch(channel, sqltable) Read a table from an ODBC
database into a data frame
Hint- a good site to learn R http://www.statmethods.net
A Detour to SQL
From Web (aka Web Scraping)
• readlines Hint : R is case sensitivereadlines is not the same as readLines
Hint : Use head() and tail() to inspect objects
Other packages are XML and CurlCase Study- http://decisionstats.com/2013/04/14/using-r-for-cricket-analysis-rstats/
Inspecting Data Quality
• head()• tail()• names()• str()• objectname[I,m]• objectname$variable
Hint- Try this code please
data(mtcars)head(mtcars,10)tail(mtcars,5)names(mtcars)str(mtcars)mtcars[1,]mtcars[,2]mtcars[2,3]mtcars$cyl
Inspecting Data Quality: Demo
•
Inspecting Data Quality: Demo
•
Data Selection
• object[l,m] gives the value in l row and m column
• object[l,] will give all the values in l row• object$varname gives all values of varname • subset helps in selection
Data Selection: Demo
Questions- How do I use multiple conditions (AND OR)Can I do away with subset functionHow do I select random sample
Useful Link- http://decisionstats.com/2013/11/24/50-functions-to-clear-a-basic-interview-for-business-analytics-rstats/
Day 2 Session 1– Revision
• Exploring Data• Manipulating Data• Visualization of Data• Demos in R
• Data Exploration,• Data Manipulation, • Data Visualizations
• Break 1– Questions
Good coding practices
• Use # for comment• Use git for version control• Use Rstudio for multiple lines of code
Functions in R
• custom functions• source code for a function• Understanding help ? , ??
Packages in R
• CRAN• CRAN Views• R Documentation
Documentation in R
• Help ? And ??• CRAN Views• Package Help• Tips for Googling
– Stack Overflow– Email Lists– Twitter– R Bloggers
Interfaces to R
• Console
• IDER Studio
• GUIGraphical User Interface
Graphical Interfaces to R
• R Commander
• Rattle
• Deducer
Installation of R Commander
Overview of R Commander
DemoR Commander – 3D Graphs
Installation of Rattle
Installation of Rattle
Installation of Rattle
Installation of Rattle
Installation of Rattle
• GTK+ Installation Necessary
• Install other packages when prompted
Installation of Rattle
• GTK+ Installation Necessary
• Install other packages when prompted
Overview of Rattle
Demo Rattle
Installation Deducer (with JGR)
Installation Deducer (with JGR)
Installation Deducer (with JGR)
Installation Deducer (with JGR)
Installation Deducer (with JGR)
Installation Deducer (with JGR)
Installation Deducer (with JGR)
Overview of Deducer (with JGR)
Demo Deducer
• data()• data(mtcars)
Data Exploration
• summary()• table()• describe() (Hmisc)• summarize()(Hmisc)
Hint- Try this code please
summary(mtcars)table(mtcars$cyl)
library(Hmisc)describe(mtcars)
summarize(mtcars$mpg,mtcars$cyl,mean)
CLASS WORK- •Use table command for two variables•Summarize mtcars$mpg for two variables (cyl , gear)•Try and find min and max for the same
Data Exploration
• missing values are represented by NA in R• Demo
– is.na– na.omit– na.rm
Data Visualization
Notes- Explaining Basic Types of Graphs
Customizing GraphsGraph OutputAdvanced GraphsFacets,
Grammar of GraphicsData Visualization Rules
Data Manipulation Demo
Notes-1. gsub2. gsub with
escape 3. as operator4. is operator
Text Manipulation
Functions-ncharsubstrpaste
Date Manipulation
Date Manipulation
Hit escape to escape the + signs+ signs occur due to unclosed quotes or brackets
Use ? help generously
Class WorkWhat is your age in days as of today?What is your age in weeks as of today?Hint- > age2=difftime(Sys.Date(),dob2,units='weeks')> age2Time difference of 1959.286 weeks
Data Output
• Graphical Output • Numerical Output (aggregation)
Data Output
• Graphical Output • Numerical Output (aggregation)
Data Output
• Graphical Output
Data Output
• Use objects to summarize• Use write.csv• Use setwd() to set location of output
EconometricsComing up Regression
Correlation
Regression
Notes-Correlation is not causationHow do we determine which is dependent and which are independent variables
Regression
Regression using R Commander
Lies True Lies and Statistics
• Anscombe -case study
Regression Recap
• cor• lm• anova• summary and plot of lm object• residuals• p value
– vif– heteroskedascity– outliers
Propensity Modeling in Industry
• Response Rates• Lift• Test and Control groups
Day 2 Session 2– Recap
• Data Mining• Regression Models• Advanced Topics• Demos in R
• Data Mining,• Model Building, • Advanced Topics
• Summary and Conclusion
• Break 2– Questions
Data Mining
• Rattle– association analysis– cluster analysis– modeling
Rattle
• Analyze wine
Rattle
• Analyze wine
Rattle
• Analyze wine
Rattle
• Cluster Analysis
Data Mining
• Brief Introduction
– Affinity analysis is a data analysis and data mining technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling,
Rattle
• Brief Introduction– market basket analysis – Market basket analysis might tell a retailer that customers often
purchase shampoo and conditioner together, so putting both items on promotion at the same time would not create a significant increase in revenue, while a promotion involving just one of the items would likely drive sales of the other
Rattle
• Brief Introduction– association rules– if butter and bread are bought, customers also buy milk
Example database with 4 items and 5 transactionstransactio
n ID milk bread butter beer
1 1 1 0 02 0 0 1 03 0 0 0 14 1 1 1 05 0 1 0 0
Rattle
• Brief Introduction– association rules– the itemset (milk,bread->butter) has a support of 20% since it occurs in 20% of all
transactions (1 out of 5 transactions).– the itemset (milk,bread->butter) has a confidence of 50% since it occurs in 50% of all
such transactions (1 out of 2 transactions).–
Rattle
• Brief Introduction– association rules
Regression Models
• lm function• Understanding output• Diagnostics
– homoskedasticity – Multicollinearity – p value– Residuals
Advanced Topics :Demos
• Time Series Analysis (use epack plugin) http://decisionstats.com/2010/10/22/doing-time-series-using-a-r-gui/
Advanced Topics :Demos
• Advanced Data Visualization ( kmggplot2 plugin)
http://decisionstats.com/2012/05/21/new-rcommander-with-ggplot-rstats/
Advanced Topics :Demos
Social Network Analysis (sna)
Facebookhttp://decisionstats.com/2014/05/10/analyzing-facebook-networks-using-rstats/
Twitterhttp://www.slideshare.net/ajayohri/twitter-analysis-by-kaify-rais
Advanced Topics :Demos
• Spatial Analysis• ggmap demo• http://decisionstats.com/2013/08/19/the-wonderful-ggmap-package-for-spatial-analysis-in-r-rstats/
• rmaps• http://rcharts.io/viewer/?9223554#.Uw4hOPmSySp
Thank You
• http://linkedin.com/in/ajayohri• [email protected]