a short tutorial on r for data science - bowei chena short tutorial on r for data science bowei chen...
TRANSCRIPT
![Page 1: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/1.jpg)
A short tutorial on R for data science
Bowei Chen
School of Computer Science
University of Lincoln
2016 - 2017
![Page 2: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/2.jpg)
Preface
This short tutorial is to give a practical introduction to R for data science programming. It
aims at the undergraduate students or practitioners who have no background or experience
in data science or statistics. It should be noted that the tutorial is focused on teaching basic
R data programming skills from scratch rather than data science algorithms.
The tutorial is created based on several open-source materials in R Community (see the key
references section for details). It has been used in the workshops of the Data Science
module in the School of Computer Science at the University of Lincoln, UK. The content
is around 8 hours’ study. Thanks the module demonstrators Deema Abdal Hafeth and
Jingmin Huang who have provided help with exercises.
2
![Page 3: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/3.jpg)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
3
![Page 4: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/4.jpg)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
4
![Page 5: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/5.jpg)
What is R?
• R is a free software environment for
statistical computing and graphics.
• R compiles and runs on a wide variety of
UNIX platforms, Windows and MacOS.
• R can be downloaded at:
https://cran.r-project.org/Old logo New logo
5
![Page 6: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/6.jpg)
Comprehensive R Archive Network (CRAN)
• CRAN includes packages which provide additional functionalities.
• Over 7,801 additional packages (as of January 2016) available at CRAN, Bioconductor,
Omegahat, GitHub, and other repositories.
• R packages are written mainly by academics and company staff.
• The R Foundation is seated in Vienna, Austria and currently hosted by the Vienna
University of Economics and Business. It is a registered association under Austrian law
and active worldwide.
6
![Page 7: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/7.jpg)
Short history of R (1/2)
• S is a statistical programming language developed primarily by John Chambers, Rick
Becker and Allan Wilks at Bell Laboratories since 1976.
• The two modern implementations of S are:
– R: part of the GNU free software project
– S-PLUS (or S+): A commercial product sold by TIBCO Software
7
![Page 8: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/8.jpg)
Short history of R (2/2)
• S-PLUS is a commercial implementation of the S programming language sold by TIBCO
Software Inc.
• R was created by Ross Ihaka and Robert Gentleman at the University of Auckland,
New Zealand, and is currently developed by the R Development Core Team, of which
John Chambers is a member. R is named partly after
the first names of the first two R authors and partly as a play on the name of S.
8
![Page 9: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/9.jpg)
What can you do using R? (1/2)
• Data entry and manipulation
– Input data
• from keyboard
• from spreadsheet
• from another statistics package
– Manipulate data
• Statistical analysis
– Descriptive statistics
– Statistical inference
9
![Page 10: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/10.jpg)
What can you do using R? (2/2)
• Graphical display
– Predefined plots for some models
– Flexible, powerful options
– Save to image files in various formats
• Write new functions
– Make a change to an existing function
– Create new functions tailored to your exact needs
– Contribute a new package
• Create documents (with Sweave, knitr)
– PDF (article and slides)
– HTML
10
![Page 11: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/11.jpg)
Why use R for statistical computing?
• Open source (R is a GNU S+)
• Good visualisations (ggplot2, lattice, standard plot library)
• Easier for writing custom packages and functions
• Closer to the statistics and machine learning community
• Better LaTeX support (Sweave, knitr)
• Works with Big data (Rhadoop, Rspark, RCpp)
11
![Page 12: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/12.jpg)
O’Reilly 2016
DATA SCIENCE
SALARY SURVEY
![Page 13: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/13.jpg)
Limitations of R
• The quality of some packages is less than perfect. They are not error-free!
• Many R commands give little thought to memory management, and so R can very
quickly consume all available memory. This can be a restriction when doing data
mining. There are various solutions, including using 64 bit operating systems that can
access much more memory than 32 bit ones.
• Documentation is sometimes patchy and terse, and impenetrable to the non-
statistician. However, some very high-standard books are increasingly plugging the
documentation gaps.
13
![Page 14: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/14.jpg)
RGui
When R is waiting for us to tell it what to do, it begins the line with >
Type• 'demo()' for some demos• 'help()' for on-line help• 'help.start()' for an HTML
browser interface• 'q()' to quit R
14
![Page 15: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/15.jpg)
Editors and IDEs
• Rstudio
• Jupyter Notebook
• Vim
• Emacs (ESS)
• Eclipse (StatET)
• Tinn-R
• …
15
![Page 17: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/17.jpg)
R source editor (Ctrl+1)
R console (Ctrl+2)
Environment (Ctrl+8)history (Ctrl+4)
Help (Ctrl+4)Files (Ctrl+5)Plots (Ctrl+6)
Packages (Ctrl+7)
![Page 18: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/18.jpg)
Objects
• Everything in R is an object, having a class.
• Data, intermediate results are stored in R objects
• The Class of the object both describes what the object contains and what many
standard functions
• Objects are usually accessed by name.
18
![Page 19: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/19.jpg)
R commands
• R commands are either assignments or expressions
• Commands are separated either by a semicolon ; or newline
19
![Page 20: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/20.jpg)
x <- 1+2
`<-`(x, 1+2) #same thing
x = 1+2 #same thing
Assignment operations
An assignment command evaluates
an expression and passes the value
to a variable but the result is not
printed.
20
![Page 21: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/21.jpg)
Expression operations
An expression command is evaluated
and (normally) printed.
If the statement results in a value, R will
print that value automatically.
> 1+2
[1] 3
> 1+2*3
[1] 7
> (1+2)*3
[1] 9In R, any number that you print out in the console is interpreted as a vector. A vector is an ordered collection of numbers. The “[1]” means that the index of the first item displayed in the row is 1.
21
![Page 22: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/22.jpg)
Workspace
• R stores objects in workspace that is kept in memory.
• When quitting R ask you if you want to save that workspace
• The workspace containing all objects you work on can then be restored next time you
work with R along with a history of the used commands.
22
![Page 23: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/23.jpg)
Variables (1/3)
A variable is a symbol that holds a value,
which can be any R object.
The types of variables are:
• Integer
• Double
• Character
• Logical
• Factor or categorical
23
![Page 24: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/24.jpg)
Variables (2/3)
Integer, double (numerical values)
> a = 49
> sqrt(a)
[1] 7
> a <- pi
> print(a)
[1] 3.141593
Character, string, logical
> a = "The dog ate my homework"
> sub("dog","cat",a)
[1] "The cat ate my homework“
> a = (1+1==3)
> a
[1] FALSE
24
![Page 25: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/25.jpg)
Variables (3/3)
Factor
> a <- factor(c("H", "e", "l", "l", "o"))
> print(a)
[1] H e l l o
Levels: e H l o
> class(a)
[1] "factor"
25
![Page 26: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/26.jpg)
Types of numerical variables (1/2)
When we use numerical objects, in
mathematical terms, variables can be
classified as:
• Scalars
• Vectors
• Matrices
A scalar is a single number
> x <- 5
> Y <- 100
26
![Page 27: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/27.jpg)
Types of numerical variables (2/2)
A vector is a sequence of numbers
> x <- c(3, 5, 2)
> x
[1] 3 5 2
A matrix is a two-way table of numbers
> x <- matrix(c(2, 3, 4, 5, 6, 7), nrow=3, ncol=2)
> x
[,1] [,2]
[1,] 2 5
[2,] 3 6
[3,] 4 7
27
![Page 28: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/28.jpg)
Variable names
• You can use simple variable names like x, y, A, and a (note that A and a are different
variable names). You can also use longer names like counter, index1, or
subject_id.
• A variable name can contain digits, but it cannot begin with a digit.
• Be careful about the built-in operators or symbols with your own variable names!
For example, you could create a variable named log, but then you would no longer be
able to use the logarithm function
28
![Page 29: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/29.jpg)
Comments
A comment is anything you write in your
program code that is ignored by
the computer.
Comments help others understand your
code. Anything following a “#” character is
a comment in R.
> x <- c(3, 5, 2) ## These are the doses of the new drug formulation.
29
![Page 30: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/30.jpg)
Arithmetic operators
Addition +
Subtraction -
Multiplication *
Division /
Exponentiation ^ or **
Modulus (x mod y) 5%%2 is 1 x %% y
Integer division 5%/%2 is 2 x %/% y
30
![Page 31: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/31.jpg)
Comparison operators
Equal ==
Not equal !=
Greater than >
Greater than or equal >=
Less than <
Less than or equal <=
31
![Page 32: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/32.jpg)
Logical operators
x and y x & y
x or y x | y
Not x !x
Test if x is TRUE isTRUE(x)
32
![Page 33: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/33.jpg)
Numeric functions
Absolute value abs(x)Square root sqrt(x)Ceiling(3.475) is 4 ceiling(x)Foor(3.475) is 3 floor(x)Round(3.475, digits=2) is 3.48 round(x, digits=n)Signif(3.475, digits=2) is 3.5 signif(x, digits=n)Cosine, sine, tan, … cos(x), sin(x), tan(x)Natural logarithm log(x)Common logarithm log10(x)Exponential of x exp(x)
33
![Page 34: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/34.jpg)
Control structures: if
Syntax:
if(cond1=true) { cmd1 }
> if (TRUE) {
+ "this will be printed if it is TRUE"
+ }
[1] "this will be printed if it is TRUE"
34
![Page 35: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/35.jpg)
Control structures: if-else
Syntax:
if(cond1=true) { cmd1 } else { cmd2 }
> if(1==0) {
+ print(1)
+ } else {
+ print(2)
+ }
[1] 2
35
![Page 36: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/36.jpg)
Control structures: ifelse
Syntax:
ifelse(cond, yes, no)
> ifelse(1 == 0,
+ "this will be printed if 1==0",
+ "this will not be printed if 1!=0")
[1] "this will not be printed if 1!=0"
36
![Page 37: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/37.jpg)
Control structures: for
Syntax:
for (var in seq) { expr }
> x <- c("a", "a", "a", "a", "a")
> for (i in x){
+ print(i)
+ }
[1] "a"
[1] "a"
[1] "a"
[1] "a"
[1] "a"
37
![Page 38: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/38.jpg)
Control structures: repeat
Syntax:
repeat { (cond) expr }
> i <- 10> repeat {+ if (i > 25)+ break+ else {+ print(i); i <- i + 5;+ }+ }[1] 10[1] 15[1] 20[1] 25
38
![Page 39: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/39.jpg)
Control structures: while
Syntax:
while (cond) { expr }
> i <- 10
> while (i <= 25) {
+ print(i); i <- i + 5
+ }
[1] 10
[1] 15
[1] 20
[1] 25
39
![Page 40: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/40.jpg)
Control structures: switch
Syntax:
switch(expr, ...)
> AA = 'foo'> switch(AA,+ foo = {+ print('foo') # case 'foo'+ },+ bar = {+ print('bar') # case 'bar'+ },+ {+ print('default')+ })[1] "foo"
40
![Page 41: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/41.jpg)
Installing R and RStudio on your machine
• Download R from https://cran.r-project.org/
• Download RStudio at https://www.rstudio.com/
41
![Page 42: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/42.jpg)
Exercise 1/10
demo(graphics)
demo(plotmath)
demo(Japanese)
demo(lm.glm)
demo(hclColors)
42
![Page 43: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/43.jpg)
Exercise 2/10
x<-c(4,2,6)
y<-c(1,0,-1)
length(x)
sum(x)
sum(x^2)
x+y
x*y
x-2
x^2
43
![Page 44: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/44.jpg)
Exercise 3/10
7:11
seq(2,9)
seq(4,10,by=2)
seq(3,30,length=10)
seq(6,-4,by=-2)
44
![Page 45: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/45.jpg)
Exercise 4/10
rep(2,4)
rep(c(1,2),4)
rep(c(1,2),c(4,4))
rep(1:4,4)
rep(1:4,rep(3,4))
45
![Page 46: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/46.jpg)
Exercise 5/10
c(T,T,F,F) & c(T,F,F,T)
!x
x <- seq(-3,3,length=200) > 0
1:3 + c(T,F,T)
intersect(1:10,5:15)
drinks <- factor(c("beer","beer","wine","water"))
46
![Page 47: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/47.jpg)
Exercise 6/10
x<-c(5,7,9); y<-c(6,3,4); z<-cbind(x,y);
print(z)
c(1, 2, 3, . . . , 19, 20)
x <- c(3,6,8); y <- c(2,5,1);
x[y>1.5]
x <- c(3,6,8); y <- c(2,5,1);
y[x==6]
47
![Page 48: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/48.jpg)
Exercise 7/10
x <- 1:15if (sample(x, 1) <= 10) {
print("x is less than 10")} else {
print("x is greater than 10")}
Clean all the variables (the workspace)rm(list=ls())
Clean one variablerm(x)
48
![Page 49: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/49.jpg)
Exercise 8/10
x <- c("apples", "oranges", "bananas", "strawberries")
for (i in x) {
print(i)
}
for (i in 1:4) {
print(x[i])
}
for (i in seq(x)) {
print(x[i])
}
for (i in 1:4) print(x[i])
49
![Page 50: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/50.jpg)
Exercise 9/10
i <- 1
while (i < 10) {
print(i)
i <- i + 1
}
50
![Page 51: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/51.jpg)
Exercise 10/10
z <- c("Alec", "Dan", "Rob", "Karthik"); typeof(z)
x <- c(0.5, 0.7)
x <- c(TRUE, FALSE)
x <- c("a", "b", "c", "d", "e")
x <- 9:100
x <- c(1 + (0+0i), 2 + (0+4i))
51
![Page 52: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/52.jpg)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
52
![Page 53: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/53.jpg)
R data structures
53
![Page 54: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/54.jpg)
Vectors (1/3)
Vectors are one-dimensional arrays that
can hold numeric data, character data, or
logical data. The combine function c() is
used to form the vector.
Note that the data in a vector must only
be one data type (numeric, character, or
logical).
> a <-c(1, 2, 5, 3, 6, -2, 4)
> b <-c("one", "two", "three")
> c <-c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
# a is numeric vector,
# bis a character vector, and
# c is a logical vector
54
![Page 55: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/55.jpg)
Vectors (2/3)
Scalars are one-element vectors. > f <- 3
> x <- TRUE
> y <- 100.01
55
![Page 56: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/56.jpg)
Vectors (3/3)
You can refer to elements of a vector using
a numeric vector of positions within
brackets.
> a <- c(1, 2, 5, 3, 6, -2, 4)
> a[3]
[1] 5
> a[c(1, 3, 5)]
[1] 1 5 6
> a[2:6]
[1] 2 5 3 6 -2
56
![Page 57: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/57.jpg)
Matrices (1/4)
A matrix is a two-dimensional array where each element has the same data type
(numeric, character, or logical). Matrices are created with the matrix() function.
myymatrix <- matrix(vector,
nrow=number_of_rows,
ncol=number_of_columns,
byrow=logical_value,
dimnames=list(char_vector_rownames,
char_vector_colnames)
)
57
![Page 58: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/58.jpg)
Matrices (2/4)
# Create a matrix from a vector
> vector <-c(1,2,3,4)
> foo <-matrix(vector, nrow=2, ncol=2)
> foo
[,1] [,2]
[1,] 1 3
[2,] 2 4
# Create a 5x4 matrix
> y <- matrix(1:20, nrow=5, ncol=4)
> y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
58
![Page 59: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/59.jpg)
Matrices (3/4)
Create a 2x2 matrix with labels and
fill the matrix by rows
Create a 2x2 matrix with labels and
fill the matrix by column
> cells <- c(1,26,24,68)
> rnames <- c("R1", "R2")> cnames <- c("C1", "C2")
> mymatrix <- matrix(
+ cells, nrow = 2, ncol = 2, byrow = TRUE,
+ dimnames = list(rnames, cnames) )
> mymatrix
C1 C2
R1 1 26
R2 24 68
> mymatrix <- matrix(
+ cells, nrow = 2, ncol = 2, byrow = FALSE,
+ dimnames = list(rnames, cnames))
> mymatrix
C1 C2
R1 1 24
R2 26 68 59
![Page 60: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/60.jpg)
Matrices (4/4)
You can identify rows, columns, or elements of a matrix, x, by using subscripts and brackets.
• x[i,] refers to the ith row
• x[,j] refers to jth column
• x[i,j] refers to the i,jth element
> x <- matrix(1:10, nrow=2)> x
[,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 4 6 8 10> x[2,][1] 2 4 6 8 10> x[,2][1] 3 4> x[1,4][1] 7> x[1, c(4,5)][1] 7 9
60
![Page 61: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/61.jpg)
Arrays (1/2)
Matrices are two-dimensional and, like vectors, can contain only one data type. When
there are more than two dimensions, you’ll use arrays.
myarray <- array(vector, dimensions, dimnames)
61
![Page 62: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/62.jpg)
Arrays (2/2)
> dim1 <- c("A1", "A2")> dim2 <- c("B1", "B2", "B3")> dim3 <- c("C1", "C2", "C3", "C4")> z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))> z, , C1
B1 B2 B3A1 1 3 5A2 2 4 6
, , C2
B1 B2 B3A1 7 9 11A2 8 10 12
, , C3
B1 B2 B3A1 13 15 17A2 14 16 18
, , C4
B1 B2 B3A1 19 21 23A2 20 22 24
62
![Page 63: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/63.jpg)
Data frame (1/4)
A data frame is more general than a matrix in that different columns can contain different
modes of data (numeric, character, etc.). A data frame is created with the data.frame() function
It’s similar to the datasets you’d typically see in SAS, SPSS, Stata, and Python (pandas).
Each column must have only one data type, but you can put columns of different data
types together to form the data frame. Because data frames are close to what analysts
typically think of as datasets, we’ll use the terms columns and variables interchangeably
when discussing data frames.
mydata <- data.frame(col1, col2, col3,…)
63
![Page 64: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/64.jpg)
Data frame (2/4)
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
64
![Page 65: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/65.jpg)
Data frame (3/4)
Accessing data frame elements can be
straight forward. Element can be accessed
by column names.
> patientdata$patientID
[1] 1 2 3 4
> patientdata$diabetes
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2
> patientdata$status
[1] Poor Improved Excellent Poor
Levels: Excellent Improved Poor
65
![Page 66: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/66.jpg)
Data frame (4/4)
If you want to cross tabulate diabetes type by status.
> table(patientdata$diabetes, patientdata$status)
Excellent Improved PoorType1 1 0 2Type2 0 1 0
66
![Page 67: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/67.jpg)
Some useful functions for data frame (1/7)
The attach() function adds the data frame
to the R search path. When a variable name
is encountered, data frames in the search
path are checked in order to locate the
variable.
> summary(mtcars$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
> plot(mtcars$mpg, mtcars$disp)
> plot(mtcars$mpg, mtcars$wt)
> attach(mtcars)
> summary(mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
> plot(mpg, disp)
> plot(mpg, wt)
> detach(mtcars)
67
![Page 68: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/68.jpg)
Some useful functions for data frame (2/7)
The detach() function removes the data
frame from the search path. Note that
detach() does nothing to the data frame
itself. The statement is optional but is good
programming practice and should be
included routinely.
> attach(mtcars)
> summary(mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
> plot(mpg, disp)
> plot(mpg, wt)
> detach(mtcars)
68
![Page 69: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/69.jpg)
Some useful functions for data frame (3/7)
The limitations with this approach are
evident when more than one object can
have the same name.
Here we already have an object named mpg
in our environment when the mtcars data
frame is attached. In such cases, the
original object takes precedence, which
isn’t what you want. The plot statement
fails because mpg has 3 elements and disp
has 32 elements.
> mpg <- c(25, 36, 47)
> attach(mtcars)
The following object is masked _by_ .GlobalEnv:
mpg
> plot(mpg, wt)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
69
![Page 70: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/70.jpg)
Some useful functions for data frame (4/7)
In this case, the statements within the
{} brackets are evaluated with reference
to the mtcars data frame. You don’t
have to worry about name conflicts
here. If there’s only one statement (for
example, summary(mpg)), the {} brackets are optional.
> with(mtcars, {
+ summary(mpg, disp, wt)
+ plot(mpg, disp)
+ plot(mpg, wt)
+ })
70
![Page 71: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/71.jpg)
Some useful functions for data frame (5/7)
The limitation of the with() function
is that assignments will only exist
within the function brackets.
> with(mtcars, {
stats <- summary(mpg)
stats
})
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
> stats
Error: object ‘stats’ not found
71
![Page 72: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/72.jpg)
Some useful functions for data frame (6/7)
If you need to create objects that will
exist outside of the with() construct,
use the special assignment operator <<-instead of the standard one <-. It will
save the object to the global
environment outside of the with() call.
> with(mtcars, {
nokeepstats <- summary(mpg)
keepstats <<- summary(mpg)
})
> nokeepstats
Error: object ‘nokeepstats’ not found
> keepstats
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
72
![Page 73: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/73.jpg)
Some useful functions for data frame (7/7)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
73
![Page 74: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/74.jpg)
Factors (1/3)
Categorical (nominal) and ordered
categorical (ordinal) variables in R are
called factors.
The function factor() stores the
categorical values as a vector of integers
in the range [1... k] (where k is the
number of unique values in the nominal
variable), and an internal vector of
character strings (the original values)
mapped to these integers.
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> diabetes
[1] "Type1" "Type2" "Type1" "Type1"
74
![Page 75: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/75.jpg)
Factors (2/3)
> patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> diabetes <- factor(diabetes)
> status <- factor(status, order=TRUE)
> patientdata <- data.frame(patientID, age, diabetes, status)
> str(patientdata)
‘data.frame’: 4 obs. of 4 variables:
$ patientID: num 1 2 3 4 w
$ age : num 25 34 28 52
$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3 75
![Page 76: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/76.jpg)
Factors (3/3)
> summary(patientdata)
patientID age diabetes status
Min. :1.00 Min. :25.00 Type1:3 Excellent:1
1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1
Median :2.50 Median :31.00 Poor :2
Mean :2.50 Mean :34.75
3rd Qu.:3.25 3rd Qu.:38.50
Max. :4.00 Max. :52.00
76
![Page 77: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/77.jpg)
Lists (1/2)
Lists are the most complex of the R data
types. Basically, a list is an ordered
collection of objects (components). A list
allows you to gather a variety of (possibly
unrelated) objects under one name.
mylist <- list(object1, object2, …)
mylist <- list(name1=object1, name2=object2, …)
77
![Page 78: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/78.jpg)
Lists (2/2)
> g <- "My First List"> h <- c(25, 26, 18, 39)> j <- matrix(1:10, nrow=5)> k <- c("one", "two", "three")> mylist <- list(title=g, ages=h, j, k)
> mylist$title[1] "My First List"$ages[1] 25 26 18 39[[3]][,1] [,2][1,] 1 6[2,] 2 7[3,] 3 8[4,] 4 9[5,] 5 10[[4]][1] "one" "two" "three"
> mylist[[2]][1] 25 26 18 39> mylist[["ages"]][[1] 25 26 18 39
78
![Page 79: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/79.jpg)
Exercise 1/10
# Declare different variablestypesmy_numeric <- 42my_character <- "universe“my_logical <- FALSE
# Check class of my_numericclass(my_numeric)
# Check class of my_characterclass(my_character)
# Check class of my_logicalclass(my_logical)
![Page 80: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/80.jpg)
Exercise 2/10
# Vector operations
a) Create a verctor like 1,2,3, . . ., 10
b) Get the length of the above vector
c) Get the last three numbers from the vector
d) Sort the numbers with decreasing order
e) Remove the number 9 from the above vector
![Page 81: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/81.jpg)
Exercise 3/10
# Vector operations
a) Create a vector from 1 to 3.1415 with the length of 100
b) Create a vector from -2 to 0.1 with the length of 100
c) Get the sum and inner product of a and b
![Page 82: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/82.jpg)
Exercise 4/10
# Vector operations
a) Create a vector x contains 2, 3, 4, 1
b) Create a vector y contains 1, 1, 3, 7
c) Combine column vectors x, y
![Page 83: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/83.jpg)
Exercise 5/10
# Vector operations
Use rep() function to create the following vectors:
a) “0” “x” “0” “x” “0” “x”
b) 1 3 2 1 3 2 1 3 2 1 3 2
c) 1 1 1 2 2 2 3 3 3
![Page 84: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/84.jpg)
Exercise 6/10
# Matrix operations
a) Create a matrix which contains values from 1 to 100 with 5 rows and 20 columns
b) Print out the dimensions of the matrix
c) Find out the 4th column’s sum
d) Find out the sum of row 3 and row 17
e) Assign the following names to the rows:
“A”, “B”, “C”, “D”, “E”
![Page 85: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/85.jpg)
Exercise 7/10
# Matrix operations
a) Use matrix() function to create the following matrix:
TypeA TypeB TypeC
Navarra 190 8 22
Zaragoza 191 4 1.7
Madrid 223 80 2.0
b) Add the following column into the matrix:
TypeD
2.00
3.50
2.75
c) Use apply() function to calculate the means of each column of the matrix
![Page 86: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/86.jpg)
Exercise 8/10
# Array operationsCreate the following array, , 1
[,1] [,2] [,3][1,] 1 4 7[2,] 2 5 8[3,] 3 6 9
, , 2
[,1] [,2] [,3][1,] 10 13 16[2,] 11 14 17[3,] 12 15 18
, , 3
[,1] [,2] [,3][1,] 19 22 25[2,] 20 23 26[3,] 21 24 27
![Page 87: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/87.jpg)
Exercise 9/10
# Data frame operations
Type df <- iris, then
a) Print out the dimensions of df
b) Find out the sum of “Sepal.Width” column
c) Rename column “Species” as “label”
d) Find out how many records with “Petal.Length” larger than 1.41
![Page 88: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/88.jpg)
Exercise 10/10
# List operations
Create the following list and save it to the variable x:
[[1]]
[1] 2 3 5
[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
[[3]]
[1] TRUE FALSE TRUE FALSE FALSE
[[4]]
[1] 3
![Page 89: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/89.jpg)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
89
![Page 90: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/90.jpg)
Sources of data for R
90
![Page 91: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/91.jpg)
Entering data from the keyboard
Perhaps the simplest method of data entry
is from the keyboard. The edit() function
in R will invoke a text editor that will allow
you to enter your data manually.
> mydata <- data.frame(age = numeric(0),
+ gender = character(0),
+ weight = numeric(0))
> mydata <- edit(mydata)
91
![Page 92: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/92.jpg)
Importing data from Excel
There are many R packages can allow you to import data from excel. For example:
openxlsxXLConnectxlsx…
Note that most of the advice is for pre-Excel 2007 spreadsheets and not the later .xlsx format.
> install.packages("openxlsx")
> library("openxlsx")> df <-+ read.xlsx(+ "PublicHealthEnglandDataTableDistrict.xlsx",+ sheet = 1,+ startRow = 1,+ colNames = TRUE+ )
92
![Page 93: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/93.jpg)
Importing data from a delimited text file (1/2)
You can import data from delimited text files using read.table() , a function that
reads a file in table format and saves it as a data frame.
where file is a delimited ASCII file , header is a logical value indicating whether
the first row contains variable names (TRUE or FALSE), sep specifies the delimiter
separating data values, and row.names is an optional parameter specifying one or more
variables to represent row identifiers.
> mydataframe <- read.table(file, header = logical_value,
+ sep = "delimiter",
+ row.names = "name")
93
![Page 94: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/94.jpg)
Importing data from a delimited text file (2/2)
> file <- paste0(path, '/AMZN.csv')
> df <- read.table(file, header = TRUE, sep = ",")
> head(df)
Date Open High Low Close Volume Adj.Close
1 2016-03-04 581.07 581.40 571.07 575.14 3405100 575.14
2 2016-03-03 577.96 579.87 573.11 577.49 2736700 577.49
3 2016-03-02 581.75 585.00 573.70 580.21 4576900 580.21
4 2016-03-01 556.29 579.25 556.00 579.04 5014400 579.04
5 2016-02-29 554.00 564.81 552.51 552.52 4013400 552.52
6 2016-02-26 560.12 562.50 553.17 555.23 4858200 555.23
94
![Page 95: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/95.jpg)
Importing data from XML
> # install and load the necessary package
> install.packages(“XML”)
> library(XML)
> xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"
> xmlfile <- xmlTreeParse(xml.url)
> class(xmlfile)
[1] "XMLDocument" "XMLAbstractDocument"
> xmltop = xmlRoot(xmlfile)
> plantcat <- xmlSApply(xmltop, function(x) { xmlSApply(x, xmlValue) } )
> # Finally, get the data in a data-frame and have a look at the first rows and columns
> plantcat_df <- data.frame(t(plantcat),row.names = NULL)
> plantcat_df[1:5,1:4]
95
![Page 96: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/96.jpg)
Importing data from R package
> library(MASS)
> data()
> data(phones)
> phones
$year
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
$calls
[1] 4.4 4.7 4.7 5.9 6.6 7.3 8.1 8.8 10.6 12.0 13.5 14.9 16.1
[14] 21.2 119.0 124.0 142.0 159.0 182.0 212.0 43.0 24.0 27.0 29.0
96
![Page 97: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/97.jpg)
Importing data from other sources
• Importing SPSS files into R
• Importing Stata files into R
• Importing SAS files into R
• Importing Minitab files into R
• Importing Matlab files into R
• …
97
![Page 98: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/98.jpg)
Importing data in RStudio (1/2)
98
![Page 99: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/99.jpg)
Importing data in RStudio (2/2)
99
![Page 100: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/100.jpg)
Writing data frame into csv or txt files
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
write.csv(...)
write.csv2(...)
100
![Page 101: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/101.jpg)
Useful functions for working with data objects (1/2)
Number of elements/components length(object)Dimensions of an object dim(object)Structure of an object str(object)Class or type of an object class(object)How an object is stored mode(object)Names of components in an object names(object)Combines objects into a vector c(object, object,...)Combines objects as columns cbind(object, object, ...)Combines objects as rows rbind(object, object, ...)Prints the object object
101
![Page 102: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/102.jpg)
Useful functions for working with data objects (2/2)
Lists the first part of the object head(object)Lists the last part of the object tail(object)Lists current objects ls()Deletes one or more objects. rm(object, object, ...)Edits object and saves as new object newobject <- edit(object)
102
![Page 103: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/103.jpg)
The drop= argument
By default, subscripting operations reduce
the dimensions of an array
whenever possible. To avoid that, we can
use the drop=FALSE argument
> mat <- matrix(1:12, 3, 4, byrow = TRUE)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
> s1 <- mat[1,]; s1
[1] 1 2 3 4
> dim(s1)
NULL
> s2 <- mat[1,,drop=FALSE]; s2
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
> dim(s2)
[1] 1 4
103
![Page 104: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/104.jpg)
Combined selection
Suppose we want to get all the columns for which the element at the first row is less than 3:
> mat <- matrix(1:12, 3, 4, byrow = TRUE)> mat
[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12
> mycols <- mat[1,] < 3; mycols[1] TRUE TRUE FALSE FALSE
> mat[ , mycols, drop=FALSE][,1] [,2][1,] 1 2[2,] 5 6[3,] 9 10
104
![Page 105: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/105.jpg)
Using SQL statements to manipulate data frames
# install the package
> install.packages("sqldf")
> library(sqldf)
> newdf <- sqldf("select * from mtcars where carb=1 order by mpg", row.names=TRUE)
> newdf
mpg cyl disp hp drat wt qsec vs am gear carb
Valiant 18.1 6 225.0 105 2.76 3.46 20.2 1 0 3 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.21 19.4 1 0 3 1
Toyota Corona 21.5 4 120.1 97 3.70 2.46 20.0 1 0 3 1
Datsun 710 22.8 4 108.0 93 3.85 2.32 18.6 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.94 18.9 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.20 19.5 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.83 19.9 1 1 4 1105
![Page 106: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/106.jpg)
Exercise 1/8
1) Create a vector x represent numbers from 1 to 11
2) Save x into the x.RData file
3) Remove the object x from R workspace
4) Import the x.RData file into R and save it to x.
Please google it
![Page 107: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/107.jpg)
Exercise 2/8
1) Create the following data frame dfA
ID Case Number1 case1 102 case2 203 case3 30
2) Save dfA into the dfA.csv file
![Page 108: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/108.jpg)
Exercise 3/8
1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard
2) Import the dataset into R using openxlsx package
3) Show the first and last 20 lines of the dataset, respectively
4) Obtain the column names of the dataset
5) Create a new data frame which has the same column names of the dataset and has
the first and last 20 lines of the dataset
![Page 109: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/109.jpg)
Exercise 4/8
1) Download the file AMZN.csv from Blackboard
2) Import the dataset into R
3) Show the class of all columns/fields
4) Create a new data frame where Open <= 570 and Close >= 550
5) Sort the data frame by High (in decreasing order)
6) Create another data frame where Close >= Open
![Page 110: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/110.jpg)
Exercise 5/8
1) Import data from url "http://www.w3schools.com/xml/plant_catalog.xml"
2) Use the xmlTreePares function to parse xml file directly from the web
3) Use the xmlRoot function to access the top node
![Page 111: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/111.jpg)
Exercise 6/8
1) Install the MASS package
2) Find Cars93 dataset
3) Extract all the records for the Volkswagen from the field Manufacturer
4) Order the extracted records (ascend) by Price and save it to a data frame
5) Write the data frame to Cars93FilteredData.csv
![Page 112: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/112.jpg)
Exercise 7/8
1) Use SQL statements to manipulate data frame as required in Exercise 4/6
2) Write the data frame to Cars93FilteredData.RData
![Page 113: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/113.jpg)
Exercise 8/8
1) Download the files iris1.csv and iris2.csv from Blackboard
2) Import these two files into R
3) Combine these two datasets into one data frame
4) Calculate the mean value of every columns
5) What will you do with missing values?
![Page 114: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/114.jpg)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
114
![Page 115: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/115.jpg)
Exploratory graphs
If you are familiar with statistical graphical representations, please
skip this part
![Page 116: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/116.jpg)
Pie chart
AL5% AR
5%AZ5%
CA5%
CO4%
CT7%
DE4%
FL6%GA
4%IA5%
ID4%
IL4%
IN4%
KS5%
KY3%
LA5%
MA6%
MD4%
ME6%
MI5%
taxs
AL AR AZ CA CO CT DEFL GA IA ID IL IN KSKY LA MA MD ME MI
Dataset: Cigarette
A pie chart is used to show the
relative frequencies or percentages
of the levels of a categorical variable
with wedges of a pie/circle..
It is very useful when creating a well
designed document that is intended
to people that will not read the data
(e.g., management)
![Page 117: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/117.jpg)
Scatter plot
With a scatter plot a mark,
usually a dot or small circle,
represents a single data point.
With one mark (point) for every
data point a visual distribution
of the data can be seen.
Depending on how tightly the
points cluster together, you may
be able to discern a clear trend
in the data.
y = 31.887x - 62057
0
100
200
300
400
500
600
700
1949 1951 1953 1955 1957 1959
Dataset: AirPassengers
AirPassengers Linear (AirPassengers)Date
Number of air
passengers
![Page 118: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/118.jpg)
Line plot
A line plot provides an excellent
way to map independent and
dependent variables that are both
quantitative.
It is clear to see how things are
going by the rises and falls a line
plot shows.
0
100
200
300
400
500
600
700
1949 1951.4166671953.833333 1956.25 1958.666667
Dataset: AirPassengers
AirPassengersDate
Number of air
passengers
![Page 119: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/119.jpg)
Multiple line plot
Multiple line plots have space-
saving characteristics. Because
the data values are marked by
small marks (points) and not
bars, they do not have to be
offset from each other (only
when data values are very dense does this become a problem).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
Dataset: StockShare
Stock1 Stock2 Stock3
Day
Cumulative
percentage
![Page 120: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/120.jpg)
Area chart/graph
An area chart/graph displays
graphically quantitative data.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
Dataset: StockShare
Stock1 Stock2 Stock3
Day
Cumulative
percentage
![Page 121: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/121.jpg)
Bar chart
A bar plot is a chart that shows
grouped data with rectangular
bars with lengths proportional to
the values that they show. The
bars can be plotted vertically or
horizontally.
It is one of the best methods to
summarise categorical data.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8 9 10
Dataset: StockShare
Stock1 Stock2 Stock3
Day
Percentage
![Page 122: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/122.jpg)
Histogram
A histogram is a graphical
representation of the distribution of
quantitative data. It is an estimate
of the probability distribution of a
quantitative variable and was first introduced by Karl Pearson.
0
5
10
15
20
25
40
42
44
46
48
50
52
54
56
58
60
Dataset: MSFT
Adjust
closing price
Frequency
![Page 123: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/123.jpg)
Histogram with
distribution fit
A histogram with a distribution
fit is normally used to show the
empirical distribution of the
variable. Sometimes, we use
the Normal/Gaussian distribution to fit the histogram.
0
5
10
15
20
25
40
42
44
46
48
50
52
54
56
58
60
Dataset: MSFT
Adjust
closing price
Frequency
![Page 124: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/124.jpg)
Base plotting system in R
![Page 125: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/125.jpg)
Dataset (1/3)
> data(Chem97, package = "mlmRev")
> head(Chem97)
lea school student score gender age gcsescore gcsecnt
1 1 1 1 4 F 3 6.625 0.3393157
2 1 1 2 10 F -3 7.625 1.3393157
3 1 1 3 10 F -4 7.250 0.9643157
4 1 1 4 10 F -2 7.500 1.2143157
5 1 1 5 8 F -1 6.444 0.1583157
6 1 1 6 10 F 4 7.750 1.4643157
125
![Page 126: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/126.jpg)
Dataset (2/3)
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
126
![Page 127: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/127.jpg)
Dataset (3/3)
> data(EuStockMarkets)
> EuStockMarkets <- data.frame(EuStockMarkets)
> head(EuStockMarkets)
DAX SMI CAC FTSE
1 1628.75 1678.1 1772.8 2443.6
2 1613.63 1688.5 1750.5 2460.2
3 1606.51 1678.6 1718.0 2448.2
4 1621.04 1684.1 1708.1 2470.4
5 1618.16 1686.6 1723.1 2484.7
6 1610.61 1671.6 1714.3 2466.8
127
![Page 128: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/128.jpg)
Histogram (1/2)
> hist(Chem97$gcsescore)
128
![Page 129: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/129.jpg)
Histogram (2/2)
> hist(+ Chem97$gcsescore,+ main = "Histogram",+ xlab = "gcsescore",+ ylab = "Frequency",+ col = "green"+ )
129
![Page 130: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/130.jpg)
Boxplot (1/2)
> boxplot(Chem97$gcsescore,
+ main = 'title',
+ ylab = 'gcsescore')
130
![Page 131: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/131.jpg)
Boxplot (2/2)
> boxplot(+ Chem97$gcsescore,+ Chem97$age,+ main = 'title',+ ylab = 'value',+ names = c('gcsescore','age')+ )
131
![Page 132: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/132.jpg)
Scatter plot (1/3)
> plot(
+ Chem97$gcsescore,
+ Chem97$gcsecnt,
+ main = "title",
+ xlab = "gcsescore",
+ ylab = 'gcsecnt',
+ col = "blue"
+ )
132
![Page 133: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/133.jpg)
Scatter plot (2/3)
> pairs(iris)
133
![Page 134: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/134.jpg)
Scatter plot (3/3)
> pairs(iris, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])134
![Page 135: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/135.jpg)
Line plot (1/3)
> plot(
+ EuStockMarkets$DAX,
+ type = "l",
+ main = 'EuStockMarkets',
+ xlab = 'Day',
+ ylab = 'DAX'
+ )
135
![Page 136: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/136.jpg)
Line plot (2/3)> plot(
+ EuStockMarkets$DAX,
+ type = "l", col = 'red',
+ xlab = 'Day', ylab = 'Price'
+ )
> lines(EuStockMarkets$FTSE,
+ type = "l", col = 'blue')
> title("EuStockMarkets", cex.main = 1.1)
> legend(
+ 100, 5500, c("DAX", "FTSE"),
+ col = c('red', 'blue'),
+ text.col = "black",
+ lty = c(1,1), merge = TRUE
+ ) 136
![Page 137: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/137.jpg)
Line plot (3/3)
> plot(
+ EuStockMarkets$DAX,
+ EuStockMarkets$CAC,
+ type = "l",
+ main = 'EuStockMarkets',
+ xlab = 'DAX',
+ ylab = 'CAC'
+ )
137
![Page 138: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/138.jpg)
Exercise 1/5
1) Create a vector x from a series 1 to 1000
2) Create a vector y from a series 12 to 10002
3) Generate the following scatter plot that x on x-axis and y on y-axis
![Page 139: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/139.jpg)
Exercise 2/5
1) Create a data frame df that contains 3 variables: x, y, z (i.e., 3 columns)
2) Each variable has 500 observations (i.e., 500 rows)
3) x follows a standard norm distribution N(0,1)
4) y follows a continuous uniform distribution U[0,1]
5) z follows a poison distribution Poisson(0.5)
6) Generate a pairs plot for x, y, z
Please google the pairs function
![Page 140: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/140.jpg)
Exercise 3/5
Plot the following figure where x is in 0, 2𝜋 , and y is sin(𝑥)
![Page 141: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/141.jpg)
Exercise 4/5
1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard
2) Import the dataset into R using openxlsx package
3) Save the data frame into df
4) Plot the histogram of df (as same as on the right)
hint: a) bandwidth; b) values on x-axis
![Page 142: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/142.jpg)
Exercise 5/5
1) Download the file AMZN.csv from Blackboard
2) Import the dataset into R
3) Plot the multiple lines figure as below
![Page 143: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/143.jpg)
R graphics packages:
lattice & ggplot2
![Page 144: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/144.jpg)
Author
lattice was developed and
maintained by Deepayan Sarkar,
Assistant Professor at Indian
Statistical Institute.
http://www.isid.ac.in/~deepayan/
144
![Page 145: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/145.jpg)
Histogram by wrap
> pl <- histogram(~ gcsescore |
+ factor(score), data = Chem97)
> print(pl)
145
![Page 146: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/146.jpg)
Density by wrap
> pl <- densityplot(
+ ~ gcsescore | factor(score),
+ data = Chem97,
+ plot.points = FALSE,
+ ref = TRUE
+ )
> print(pl)
146
![Page 147: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/147.jpg)
Density plot by different colour
> pl <- densityplot(
+ ~ gcsescore,
+ data = Chem97,
+ groups = score,
+ plot.points = FALSE,
+ ref = TRUE,
+ auto.key = list(columns = 3)
+ )
> print(pl)
147
![Page 148: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/148.jpg)
boxplot by wrap (1/2)
> pl <- bwplot(
+ gcsescore ^ 2.34 ~ gender | factor(score),
+ Chem97,
+ varwidth = TRUE,
+ layout = c(6, 1),
+ ylab = "Transformed GCSE score"
+ )
> print(pl)
148
![Page 149: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/149.jpg)
boxplot by wrap (2/2)
> pl <- densityplot(
+ ~ gcsescore,
+ data = Chem97,
+ groups = score,
+ plot.points = FALSE,
+ ref = TRUE,
+ auto.key = list(columns = 3)
+ )
> print(pl)
149
![Page 150: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/150.jpg)
There are many other functions in lattice
Below the references will be useful:
• http://www.isid.ac.in/~deepayan/R-tutorials/labs/04_lattice_lab.pdf
• https://www.stat.auckland.ac.nz/~paul/RGraphics/chapter4.pdf
• https://fas-web.sunderland.ac.uk/~cs0her/Statistics/UsingLatticeGraphicsInR.htm
150
![Page 151: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/151.jpg)
Author
ggplot2 was developed by Hadley
Wickham, Chief Scientist at RStudio, and
an Adjunct Professor of Statistics at the
University of Auckland.
http://hadley.nz/
151
![Page 152: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/152.jpg)
Histogram by wrap
> pg <-
+ ggplot(Chem97, aes(gcsescore)) +
+ geom_histogram(binwidth = 0.5) +
+ facet_wrap( ~ score)
> print(pg)
152
![Page 153: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/153.jpg)
Density plot by wrap
> pg <- ggplot(Chem97, aes(gcsescore)) +
+ stat_density(geom = "path",
+ position = "identity") +
+ facet_wrap(~ score)
> print(pg)
153
![Page 154: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/154.jpg)
Density plot by different colour
> pg <- ggplot(Chem97, aes(gcsescore)) +
+ stat_density(geom = "path",
+ position = "identity",
+ aes(colour = factor(score)))
> print(pg)
154
![Page 155: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/155.jpg)
boxplot by wrap (1/2)
> pg <- ggplot(Chem97,
+ aes(factor(gender),
+ gcsescore^2.34)) +
+ geom_boxplot() +
+ facet_grid(~score) +
+ ylab("Transformed GCSE score")
> print(pg)
155
![Page 156: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/156.jpg)
boxplot by wrap (2/2)
> pg <- ggplot(Chem97,
+ aes(factor(score),
+ gcsescore)) +
+ geom_boxplot() +
+ coord_flip() +
+ ylab("Average GCSE score") +
+ facet_wrap( ~ gender)
> print(pg)
156
![Page 157: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/157.jpg)
There are many other functions in ggplot2
Below the references will be useful:
• http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf
• http://www.statmethods.net/advgraphs/ggplot2.html
• http://blog.echen.me/2012/01/17/quick-introduction-to-ggplot2/
• http://www.stat.wisc.edu/~larget/stat302/chap2.pdf
157
![Page 158: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/158.jpg)
Exercise 1/11
Use lattice or ggplot package to draw the figure as below
> data(postdoc, package = "latticeExtra")
> pl <- barchart(prop.table(postdoc, margin = 1),
+ xlab = "Proportion",
+ auto.key = list(adj = 1))
> print(pl)
![Page 159: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/159.jpg)
Exercise 2/11
1) Read the dataset PublicHealthEnglandDataTableDistrict.xlsx
2) Plot the following figure using lattice package
Hint: xyplot
![Page 160: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/160.jpg)
Exercise 3/11
1) Read the dataset
PublicHealthEnglandDataTa
bleDistrict.xlsx
2) Plot the following figure
using ggplot2 package
Hint: 1) ggplot; 2) plot points: 3)
by wrap
![Page 161: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/161.jpg)
Exercise 4/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using ggplot2 package
![Page 162: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/162.jpg)
Exercise 5/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using ggplot2 package
![Page 163: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/163.jpg)
Exercise 6/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using ggplot2 package
![Page 164: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/164.jpg)
Exercise 7/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using ggplot2 package
![Page 165: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/165.jpg)
Exercise 8/11
1) Read the dataset “chem97” from
“mlmRev” package
2) Plot the following figure using
ggplot2 package
![Page 166: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/166.jpg)
Exercise 9/11
1) Read the dataset “chem97” from
“mlmRev” package
2) Plot the following figure using ggplot2
package
![Page 167: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/167.jpg)
Exercise 10/11
1) Read the dataset “chem97” from
“mlmRev” package
2) Plot the following figure using ggplot2
package
![Page 168: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/168.jpg)
Exercise 11/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using lattice package
![Page 169: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/169.jpg)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
169
![Page 170: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/170.jpg)
Empty string
An empty string can be produced by
consecutive quotation marks: ""> empty_str = ""
> empty_str
[1] ""
> class(empty_str)
[1] "character"
170
![Page 171: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/171.jpg)
Vector of empty strings
character() will produce a character
vector with as many empty strings
> # vector with 5 empty strings
> char_vector = character(5)
> char_vector
[1] "" "" "" "" ""
171
![Page 172: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/172.jpg)
is.character() and as.character()
as.character() and is.character() are
generic methods for creating and testing
for objects of type "character"
> a = "test me"
> b = 8 + 9
> # are 'a' and 'b' characters?
> is.character(a)
[1] TRUE
> is.character(b)
[1] FALSE
172
![Page 173: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/173.jpg)
c() for character vector
As you can tell, the resulting vector from
combining integers (1:5), the number pi,
and some "text" is a vector with all its
elements treated as character strings. In
other words, when we combine mixed
data in vectors, strings will dominate.
> a <- c("x", "y", "c")
> a
[1] "x" "y" "c"
> b <- c(1:5, pi, "text")
> b
[1] "1"
[2] "2"
[3] "3"
[4] "4"
[5] "5"
[6] "3.14159265358979"
[7] "text"
173
![Page 174: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/174.jpg)
paste()
paste() takes one or more R objects,
converts them to "character", and then it
concatenates (pastes) them to form one or
several character strings
> PI = paste("The life of", pi)
> PI
[1] "The life of 3.14159265358979"
> IloveR = paste("I", "love", "R")
> IloveR
[1] "I love R"
> IloveR = paste0("I", "love", "R")
> IloveR
[1] "IloveR"
> IloveR = paste("I", "love", "R", sep = "-")
> IloveR
[1] "I-love-R"
> paste(1:3, c("!", "?", "+"), sep = "",
+ collapse = "")
[1] "1!2?3+"174
![Page 175: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/175.jpg)
Printing characters
Function Description
print() Generic printing
noquote() Print with no quotes
cat() Concatenation
format() Special formats
toString() Covert to string
sprintf() Printing
175
![Page 176: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/176.jpg)
Basic string manipulations
Function Description
nchar() Number of characters
tolower() Convert to lower case
toupper() Convert to upper case
casefold() Case folding
chartr() Character translation
abbreviate() Abbreviation
substring() Substrings of a character vector
substr() Substrings of a character vector
176
![Page 177: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/177.jpg)
Set operations
Function Description
union() Set union
intersect() Intersection
setdiff() Set difference
setequal() Equal sets
identical() Exact equality
is.element() Is element
sort() Sorting
paste(rep()) Repetition
177
![Page 178: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/178.jpg)
setequal() vs indentical()
> set7 = c("some", "random", "string")> set8 = c("some", "random", "none", "few")> set9 = c("string", "some", "random")> setequal(set7, set8)[1] FALSE> setequal(set7, set9)[1] TRUE> identical(set7, set7)[1] TRUE> identical(set7, set9)[1] FALSE
178
![Page 179: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/179.jpg)
stringr package
Thanks to Hadley Wickham, we have the
package stringr that adds more
functionality to the base functions for
handling strings in R.
stringr provides functions for:
1) Basic manipulations
2) Regular expression operations.
http://hadley.nz/
179
![Page 180: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/180.jpg)
Basic string manipulations in stringr
Function Description Similar to
str_c() string concatenation paste()
str_length() number of characters nchar()
str_sub() extracts substrings substring()
str_dup() duplicates characters
str_trim() removes leading and trailing whitespace
str_pad() pads a string
str_wrap() wraps a string paragraph strwrap()
180
![Page 181: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/181.jpg)
paste() vs str_c()
> paste("University", "of", "Lincoln")
[1] "University of Lincoln"
> paste("University", "of", "Lincoln", NULL)
[1] "University of Lincoln "
> paste("University", "of", "Lincoln", character(0))
[1] "University of Lincoln “
> library(stringr)
> str_c("University", "of", "Lincoln")
[1] "UniversityofLincoln"
> str_c("University", "of", "Lincoln", NULL)
[1] "UniversityofLincoln"
> str_c("University", "of", "Lincoln", character(0))
[1] "UniversityofLincoln“ 181
![Page 182: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/182.jpg)
nchar() vs str_length()
> nchar("The life of PI")
[1] 14
> str_length("The life of PI")
[1] 14
>
> text_str = c("one", "two", "three", NA)
> nchar(text_str)
[1] 3 3 5 2
> str_length(text_str)
[1] 3 3 5 NA
182
![Page 183: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/183.jpg)
str_sub()
> hw <- "Hadley Wickham"
> str_sub(hw, 1, 6)
[1] "Hadley"
> str_sub(hw, end = 6)
[1] "Hadley"
> str_sub(hw, 8, 14)
[1] "Wickham"
> str_sub(hw, 8)
[1] "Wickham"
> str_sub(hw, c(1, 8), c(6, 14))
[1] "Hadley" "Wickham"
> str_sub(hw, 1:3)
[1] "Hadley Wickham" "adley Wickham" "dley Wickham" 183
![Page 184: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/184.jpg)
What is a regular expression?
A regular expression (shortly regex or regexp) is a pattern describing a certain amount
of text. Basically, it is a way for a computer user or programmer to express how a
computer program should look for a specified pattern in text and then what the program
is to do when each pattern match is found.
184
![Page 185: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/185.jpg)
Functions of regex in R
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
regexec(pattern, text, ignore.case = FALSE, fixed = FALSE, useBytes = FALSE)
185
![Page 186: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/186.jpg)
Regular expression functions
Function Description
grep() Find regex matches and return (index or value)
grepl() Find regex matches and return (TRUE & FALSE)
sub() Replace the first match
gsub() Replace all the matches
regexpr() Find regex matches (position of the first match)
gregexpr() Find regex matches (position of all match)
regexec() Find regex matches (hybrid of regexpr() and gregexpr())
strsplit() Split regex matches
186
![Page 187: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/187.jpg)
Metacharacters in R (1/2)
There are some special characters that have a reserved status and they are known as metacharacters.
The metacharacters in Extended Regular Expressions (EREs) are:
In R, we need to escape them with a double backslash \\ when we want to represent them in a regex pattern
. \ | ( ) [ { $ * + ?
Metacharacter Escape in R
. \\.
$ \\$
* \\*
+ \\+
? \\?
| \\|
\ \\\
^ \\^
[ \\[
] \\]
{ \\{
} \\}
( \\(
) \\)187
![Page 188: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/188.jpg)
Metacharacters in R (2/2)
> money = "$money"
>
> sub(pattern = "$", replacement = "XXXXXX", x = money)
[1] "$moneyXXXXXX“
> money = "$money"
>
> sub(pattern = "\\$", replacement = "XXXXXX", x = money)
[1] "XXXXXXmoney"
188
![Page 189: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/189.jpg)
Sequences (1/4)Anchor Description
\\d Match a digital character
\\D match a non-digit character
\\s match a space character
\\S match a non-space character
\\w match a word character
\\W match a non-word character
\\b match a word boundary
\\B match a non-(word boundary)
\\h match a horizontal space
\\H match a non-horizontal space
\\v match a vertical space
\\V match a non-vertical space189
![Page 190: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/190.jpg)
Sequences (2/4)
> sub("\\d", "_", "the dandelion war 2010")
[1] "the dandelion war _010"
> gsub("\\d", "_", "the dandelion war 2010")
[1] "the dandelion war ____"
>
> sub("\\D", "_", "the dandelion war 2010")
[1] "_he dandelion war 2010"
> gsub("\\D", "_", "the dandelion war 2010")
[1] "__________________2010"
190
![Page 191: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/191.jpg)
Sequences (3/4)
> # replace space with "_"> sub("\\s", "_", "the dandelion war 2010")[1] "the_dandelion war 2010"> gsub("nns", "_", "the dandelion war 2010")[1] "the dandelion war 2010"> > # replace non-space with "_"> sub("\\S", "_", "the dandelion war 2010")[1] "_he dandelion war 2010"> gsub("\\S", "_", "the dandelion war 2010")[1] "___ _________ ___ ____"
191
![Page 192: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/192.jpg)
Sequences (4/4)
> # replace word with "_"
> sub("\\b", "_", "the dandelion war 2010")
[1] "_the dandelion war 2010"
> gsub("\\b", "_", "the dandelion war 2010")
[1] "_t_h_e_ _d_a_n_d_e_l_i_o_n_ _w_a_r_ _2_0_1_0_"
> # replace non-word with "_"
> sub("\\B", "_", "the dandelion war 2010")
[1] "t_he dandelion war 2010"
> gsub("\\B", "_", "the dandelion war 2010")
[1] "t_he d_an_de_li_on w_ar 2_01_0"
192
![Page 193: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/193.jpg)
Some regex character classes (1/2)
Anchor Description
[aeiou] Match any one lower case vowel
[AEIOU] Match any one upper case vowel
[0123456789] Match any digit
[0-9] Match any digit (same as previous class)
[a-z] Match any lower case ASCII letter
[A-Z] Match any upper case ASCII letter
[a-zA-Z0-9] Match any of the above classes
[^aeiou] Match anything other than a lowercase vowel
[^0-9] Match anything other than a digit193
![Page 194: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/194.jpg)
Some regex character classes (2/2)
> # some string> transport = c("car", "bike", "plane", "boat")> # look for e or i> grep(pattern = "[ei]", transport, value = TRUE)[1] "bike" "plane">> # some numeric strings> numerics = c("123", "17-April", "I-II-III", "R 3.0.1")> grep(pattern = "[01]", numerics, value = TRUE)[1] "123" "17-April" "R 3.0.1" > grep(pattern = "[0-9]", numerics, value = TRUE)[1] "123" "17-April" "R 3.0.1" > grep(pattern = "[^0-9]", numerics, value = TRUE)[1] "17-April" "I-II-III" "R 3.0.1"
194
![Page 195: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/195.jpg)
POSIX character classes (1/2)
Notation Description
[[:lower:]] Lower-case letters
[[:upper:]] Upper-case letters
[[:alpha:]] Alphabetic characters ([[:lower:]] and [[:upper:]])
[[:digit:]] Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
[[:alnum:]] Alphanumeric characters ([[:alpha:]] and [[:digit:]])
[[:blank:]] Blank characters: space and tab
[[:cntrl:]] Control characters
[[:punct:]] Punctuation characters: ! " # % & ' ( ) * + , - . / : ;
[[:space:]] Space characters: tab, newline, vertical tab, form feed, carriage return, and space
[[:xdigit:]] Hexadecimal digits: 0-9 A B C D E F a b c d e f
[[:print:]] Printable characters ([[:alpha:]], [[:punct:]] and space)
[[:graph:]] Graphical characters ([[:alpha:]] and [[:punct:]]) 195
![Page 196: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/196.jpg)
> # la vie (string)> la_vie = "La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie"> # if you print la_vie> print(la_vie)[1] "La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie"> # if you cat la_vie> cat(la_vie)La vie en #FFC0CB (rose);Cest la vie! tres jolie> > # remove space characters> gsub(pattern = "[[:blank:]]", replacement = "", la_vie)[1] "Lavieen#FFC0CB(rose);\nCestlavie!tresjolie"> # remove digits> gsub(pattern = "[[:punct:]]", replacement = "", la_vie)[1] "La vie en FFC0CB rose\nCest la vie \ttres jolie"
POSIX character classes (2/2)
196
![Page 197: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/197.jpg)
Quantifiers (1/2)
Notation Description
* The preceding item will be matched zero or more times
+ The preceding item will be matched one or more times
? The preceding item will be matched zero or more times
{n} The preceding item is matched exactly n times
{n,} The preceding item is matched n or more times
{n,m} The preceding item is matched at least n times, but not more than m times
197
![Page 198: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/198.jpg)
Quantifiers (2/2)
> strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")> grep("ac*b", strings, value = TRUE)[1] "ab" "acb" "accb" "acccb" "accccb"> grep("ac*b", strings, value = FALSE)[1] 2 3 4 5 6> grepl("ac*b", strings)[1] FALSE TRUE TRUE TRUE TRUE TRUE> grep("ac+b", strings, value = TRUE)[1] "acb" "accb" "acccb" "accccb"> grep("ac?b", strings, value = TRUE)[1] "ab" "acb"> grep("ac{2}b", strings, value = TRUE)[1] "accb"
198
![Page 199: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/199.jpg)
Regex functions in stringr
Notation Description
str_detect() Detect the presence or absence of a pattern in a string
str_extract() Extract rst piece of a string that matches a pattern
str_extract all() Extract all pieces of a string that match a pattern
str_match() Extract rst matched group from a string
str_match all() Extract all matched groups from a string
str_locate() Locate the position of the rst occurence of a pattern in a string
str_locate all() Locate the position of all occurences of a pattern in a string
str_replace() Replace rst occurrence of a matched pattern in a string
str_replace all() Replace all occurrences of a matched pattern in a string
str_split() Split up a string into a variable number of pieces
str_split_fixed() Split up a string into a xed number of pieces
199
![Page 200: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/200.jpg)
Exercise 1/3
# dollarsub("\\$", "", "$Peace-Love")
# dotsub("\\.", "", "Peace.Love")
# plussub("\\+", "", "Peace+Love")
# caretsub("\\^", "", "Peace^Love")
# vertical barsub("\\|", "", "Peace|Love")
# opening round bracketsub("\\(", "", "Peace(Love)")
# closing round bracketsub("\\)", "", "Peace(Love)")
# opening square bracketsub("\\[", "", "Peace[Love]")
# closing square bracketsub("\\]", "", "Peace[Love]")
# opening curly bracketsub("\\{", "", "PeacefLoveg")
200
![Page 201: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/201.jpg)
Exercise 2/3
# replace word boundary with "_"
sub("\\w", "_", "the dandelion war 2010")
gsub("\\w", "_", "the dandelion war 2010")
# replace non-word-boundary with "_"
sub("\\W", "_", "the dandelion war 2010")
gsub("\\W", "_", "the dandelion war 2010")
201
![Page 202: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/202.jpg)
Exercise 3/3
# people namespeople = c("rori", "emilia", "matteo", "mehmet", "filipe", "anna", "tyler", "rasmus",
"jacob", "youna", "flora", "adi")# match "m" at most oncegrep(pattern = "m?", people, value = TRUE)# match "m" exactly oncegrep(pattern = "mf1g", people, value = TRUE, perl = FALSE)# match "m" zero or more times, and "t"grep(pattern = "m*t", people, value = TRUE)# match "t"zero or more times, and "m"grep(pattern = "t*m", people, value = TRUE)
202
![Page 203: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/203.jpg)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
203
![Page 204: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/204.jpg)
Example (1/13)
This is an example of implementing linear regression models in R.
We will use the R dataset Cars93 in the MASS library
> library(MASS)
> df <- Cars93
> dim(df)
[1] 93 27
Using dim() function to see the size of data. There are 93
observations and 27 features/predictors in the dataset
![Page 205: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/205.jpg)
Example (2/13)
> head(df,3)
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain
1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front
2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front
3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only Front
Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
1 4 1.8 140 6300 2890 Yes 13.2 5 177
2 6 3.2 200 5500 2335 Yes 18.0 5 195
3 6 2.8 172 5500 2280 Yes 16.9 5 180
Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1 102 68 37 26.5 11 2705 non-USA Acura Integra
2 115 71 38 30.0 15 3560 non-USA Acura Legend
3 102 67 37 28.0 14 3375 non-USA Audi 90
Using head() function to look at a few
sample observations of the data. This is an
important step in data analysis!
![Page 206: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/206.jpg)
Example (3/13)
> sapply(df, class)
Manufacturer Model Type Min.Price Price Max.Price
"factor" "factor" "factor" "numeric" "numeric" "numeric"
MPG.city MPG.highway AirBags DriveTrain Cylinders EngineSize
"integer" "integer" "factor" "factor" "factor" "numeric"
Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers
"integer" "integer" "integer" "factor" "numeric" "integer"
Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room
"integer" "integer" "integer" "integer" "numeric" "integer"
Weight Origin Make
"integer" "factor" "factor"
Using sapply() can look at what are the
data types of each variables
![Page 207: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/207.jpg)
Example (4/13)
> plot(df$Horsepower, df$Price,
+ xlab = "Horsepower",
+ ylab = "Price")
Let’s look at two variables of cars:
horsepower and price. Do they have some
correlations?
![Page 208: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/208.jpg)
Example (5/13)> # Simple linear regression (method 2) -----------------
> model <- lm(y ~ x)
> model$coefficients
(Intercept) x
-1.3987691 0.1453712
> beta0 <- model$coefficients[1]
> beta1 <- model$coefficients[2]
>
> plot(df$Horsepower, df$Price,
+ xlab = "Horsepower",
+ ylab = "Price")
> y_hat_vec <- beta1 * df$Horsepower + beta0
> lines(df$Horsepower, y_hat_vec, lty = 2, col = 4)
> legend(50,
+ 30,
+ lty = 2,
+ col = 4,
+ "Regression line")
Estimate parameters of a simple linear
regression model by using R function
![Page 209: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/209.jpg)
> residuals_vec <- df$Price - y_hat_vec> summary(residuals_vec)
Min. 1st Qu. Median Mean 3rd Qu. Max. -16.4100 -2.7920 -0.8208 0.0000 1.8030 31.7500
Example (6/13)> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.977 on 91 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171
F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16
The residual here means the error 𝑦𝑖 − 𝑦𝑖
Estimate parameters of a simple linear
regression model by using R function
![Page 210: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/210.jpg)
Example (7/13)> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.977 on 91 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171
F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16
This is the standard deviation of the sampling
distribution of the coefficient estimate under
standard regression assumptions.
It should be noted that you are not required to
understand how standard errors are calculated.
However, if you are interested, please read
Casella’s book Chapters 11-12
Estimate parameters of a simple linear
regression model by using R function
![Page 211: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/211.jpg)
Example (8/13)> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.977 on 91 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171
F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16
• t value is the t-statistic value for testing
whether the corresponding regression
coefficient is different from 0.
• Pr(> |𝑡|) is the p-value for the hypothesis test
for the 𝑡 value. The null hypothesis is that the
coefficient is zero;
It should be noted that you are not required to
understand how t value and p-value are calculated.
However, if you are interested, please read
Casella’s book Chapters 11-12
Estimate parameters of a simple linear
regression model by using R function
![Page 212: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/212.jpg)
Example (9/13)> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.977 on 91 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171
F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16
R-squared is a statistical measure of how close
the data are to the fitted regression line. It is also
known as the coefficient of determination,
simply defined by
𝑅2 =Explained variation
Total variation
In general, the higher the R-squared, the better
the model fits your data.
It should be noted that you are not required to
understand how R-squared, multiple R-squared,
adjusted R-squared and their tests are calculated.
However, if you are interested, please read
Casella’s book Chapters 11-12
Estimate parameters of a simple linear
regression model by using R function
![Page 213: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/213.jpg)
Example (10/13)
Prediction
If a new Audi A4 has 175 horsepower, what is
the selling price of this Audi A4?
> # Prediction ------------------------------------------
>
> x_i <- 175
> y_hat_i <- beta1 * x_i + beta0
>
> plot(df$Horsepower, df$Price,
+ xlab = "Horsepower",
+ ylab = "Price")
> y_hat <- beta1 * df$Horsepower + beta0
> lines(df$Horsepower, y_hat, lty = 2, col = 4)
> points(x_i, y_hat_i, col = 2, pch=9)
> legend(75,
+ 50,
+ lty = c(2,NA),
+ pch = c(NA,9),
+ col = c(4,2),
+ c("Regression line", "New Audi A4"))
![Page 214: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/214.jpg)
Example (11/13)
> attach(df)
> pairs(
+ data.frame(
+ MPG.city,
+ MPG.highway,
+ EngineSize,
+ Horsepower,
+ Fuel.tank.capacity,
+ Length,
+ Width,
+ Rear.seat.room,
+ Luggage.room
+ )
+ )
> detach(df)
Let’s look at many
variables of cars
![Page 215: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/215.jpg)
Example (12/13)
> attach(df)
> model.multiple <-
+ lm(
+ Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room
+ )
> detach(df)
> model.multiple$coefficients
(Intercept) MPG.city MPG.highway EngineSize Horsepower Fuel.tank.capacity Length
59.1474034 0.2363122 -0.3766282 1.8048313 0.1290087 0.6154648 0.1150924
Width Rear.seat.room Luggage.room
-1.3785983 0.1206144 0.2735771
Estimate parameters of a multiple linear
regression model by using R function
![Page 216: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/216.jpg)
> summary(model.multiple)
Call:
lm(formula = Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room)
Residuals:
Min 1Q Median 3Q Max
-11.7444 -3.7098 -0.2932 2.9824 28.7627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.14740 27.51934 2.149 0.03497 *
MPG.city 0.23631 0.44678 0.529 0.59848
MPG.highway -0.37663 0.44106 -0.854 0.39598
EngineSize 1.80483 1.85233 0.974 0.33314
Horsepower 0.12901 0.02576 5.008 3.78e-06 ***
Fuel.tank.capacity 0.61546 0.50620 1.216 0.22801
Length 0.11509 0.11504 1.000 0.32044
Width -1.37860 0.49336 -2.794 0.00666 **
Rear.seat.room 0.12061 0.33957 0.355 0.72348
Luggage.room 0.27358 0.39166 0.699 0.48711
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.868 on 72 degrees of freedom (11 observations deleted due to missingness)
Multiple R-squared: 0.6914, Adjusted R-squared: 0.6528
F-statistic: 17.92 on 9 and 72 DF, p-value: 3.547e-15
Example (13/13)
![Page 217: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/217.jpg)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
217
![Page 218: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/218.jpg)
Example (1/13)
This is an example of implementing logistic regression models in R.
We will use the Housing.csv dataset
> df <- read.csv(“C:/Housing.csv”)
> dim(df)
[1] 546 12
Using dim() function to see the size of data. There are 546
observations and 12 features/predictors in the dataset
![Page 219: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/219.jpg)
Example (2/13)
> head(df)
price housesize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea
1 420 5850 3 1 2 1 0 1 0 0 1 0
2 385 4000 2 1 1 1 0 0 0 0 0 0
3 495 3060 3 1 1 1 0 0 0 0 0 0
4 605 6650 3 1 2 1 1 0 0 0 0 0
5 610 6360 2 1 1 1 0 0 0 0 0 0
6 660 4160 3 1 1 1 1 1 0 1 0 0
Using head() function to look at a few sample
(default 6) observations of the data.
![Page 220: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/220.jpg)
Example (3/13)
> lapply(df,class)
$price
[1] "numeric"
$housesize
[1] "integer"
$bedrooms
[1] "integer"
$bathrms
[1] "integer“
$stories
[1] "integer“
$driveway
[1] "integer“
…….
Using lapply() can look at what are the data types of
each variables (display in vertical way)
![Page 221: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/221.jpg)
Example (4/13)
> summary(df)
price housesize bedrooms bathrms stories driveway
Min. : 250.0 Min. : 1650 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.000
1st Qu.: 491.2 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
Median : 620.0 Median : 4600 Median :3.000 Median :1.000 Median :2.000 Median :1.000
Mean : 681.2 Mean : 5150 Mean :2.965 Mean :1.286 Mean :1.808 Mean :0.859
3rd Qu.: 820.0 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.000
Max. :1900.0 Max. :16200 Max. :6.000 Max. :4.000 Max. :4.000 Max. :1.000
recroom fullbase gashw airco garagepl prefarea
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.1777 Mean :0.3498 Mean :0.04579 Mean :0.3168 Mean :0.6923 Mean :0.2344
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :3.0000 Max. :1.0000
Using summary() to produce result summaries at each variable
![Page 222: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/222.jpg)
Example (5/13)
> summary(df$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
250.0 491.2 620.0 681.2 820.0 1900.0
Using summary() to produce the result summaries for one variable at a time
![Page 223: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/223.jpg)
Example (6/13)
Let’s create graph with two subplots. Each subplot is for a predictor. This can be very helpful for
helping understand the effect of each predictor the response variable.
> par(mfrow=c(1, 2))
> plot(df$price, df$fullbase,xlab = "Price",
+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,
+ pch = 16, col = "green",cex.lab=1.5, cex.axis=1.5, + cex.sub=1.5)
> plot(df$housesize, df$fullbase,xlab = "Housesize",
+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,
+ pch = 16, col = "blue",cex.lab=1.5, cex.axis=1.5,
+ cex.sub=1.5)
![Page 224: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/224.jpg)
Example (7/13)
> model1<-glm(fullbase~price,data=df,family=binomial)
> model1$coefficients
(Intercept) price
-1.622737e+00 1.447098e-05
> plot(df$price, df$fullbase,xlab = "Price",
+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,pch = 16,
+ col = "blue",cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)
> xprice<-seq(min(df$price),max(df$price))
> yprice<-predict(model1,list(price=xprice),type="response")
> lines(xprice,yprice)
Develop a logistic regression model by
using R built-in function
Note: The regression line may be not clear because the
big range values of price variable
![Page 225: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/225.jpg)
Example (8/13)
# get better regression line plot
> range(df$price)
[1] 250 1900
>
> plot(df$price, df$fullbase, xlim=c(0,2150),ylim=c(-1,2),
+ xlab = "Price", ylab = "Fullbase", col = "blue",
+ frame.plot=TRUE,cex=1.5,pch = 16,cex.lab=1.5,
+ cex.axis=1.5, cex.sub=1.5)
> xprice<-seq(0,2150)
> yprice<-predict(model1,list(price=xprice),type="response")
> lines(xprice,yprice)
Develop a logistic regression model by
using R built-in function
![Page 226: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/226.jpg)
Here we see:
• If response variable and predictor(s) are
positively or negatively correlated
• 𝑧 value and 𝑝-value are for the hypothesis
test to see if the coefficient is zero or not.
The null hypothesis is that the coefficient is
zero. As the 𝑝-value is much less than 0.05,
we reject the null hypothesis that 𝛽 = 0.
Example (9/13)
> summary(model1)
Call:
glm(formula = fullbase ~ price, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6778 -0.8992 -0.8012 1.3529 1.7316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***
price 0.0014471 0.0003423 4.228 2.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 688.28 on 544 degrees of freedom
AIC: 692.28
Number of Fisher Scoring iterations: 4
![Page 227: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/227.jpg)
Example (10/13)> summary(model1)
Call:
glm(formula = fullbase ~ price, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6778 -0.8992 -0.8012 1.3529 1.7316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***
price 0.0014471 0.0003423 4.228 2.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 688.28 on 544 degrees of freedom
AIC: 692.28
Number of Fisher Scoring iterations: 4
Deviance is a measure of goodness of fit of a regression
model (higher numbers indicate worse fit). The ‘Null
deviance’ shows how well the response variable is
predicted by a model that includes only the intercept
R:
model1$null.deviance (find Null deviance)model1$deviance (find Residual deviance)
For example, we have a value of 706.89 on 545 degrees
of freedom. Including the independent variables (price)
decreased the deviance to 688.28 on 544 degrees of
freedom.
The Residual Deviance has reduced by 18.61 with a loss
of one degrees of freedom.
![Page 228: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/228.jpg)
Example (11/13)> summary(model1)
Call:
glm(formula = fullbase ~ price, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6778 -0.8992 -0.8012 1.3529 1.7316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***
price 0.0014471 0.0003423 4.228 2.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 688.28 on 544 degrees of freedom
AIC: 692.28
Number of Fisher Scoring iterations: 4
The Akaike Information Criterion (AIC) provides a
method for assessing the quality of your model through
comparison of related models (the model that has the
smallest AIC is best fitted model).
Fisher scoring is a derivative of Newton’s
method for solving maximum likelihood
problems numerically.
![Page 229: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/229.jpg)
Example (12/13) Prediction
If a new house has 385.00 pounds rental price, what is
the probability of fullbase of this house?
> # Prediction ------------------------------------------
> model1<-glm(fullbase~price,data=df,family=binomial)
> plot(df$price, df$fullbase,xlab = "Price", ylab = "Fullbase",
+ frame.plot=TRUE,cex=1.5,pch = 16, col = "blue",
+ cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)
> xprice<-seq(min(df$price),max(df$price))
> yprice<-predict(model1,list(price=xprice),type="response")
> lines(xprice,yprice)
> newdata <- data.frame(price = 385.00)
> y_hat_i<-predict(model1, newdata, type="response")
> points(newdata, y_hat_i, col = 2, pch=20)
![Page 230: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/230.jpg)
>model2<-glm(fullbase~price+housesize,data=df,family=binomial)
>model2$coefficient
(Intercept) price housesize
-1.466744e+00 1.766831e-03 -7.286285e-05
> summary(model2)
Call:
glm(formula = fullbase ~ price + housesize, family = binomial,
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7777 -0.8973 -0.7971 1.3701 1.7224
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.467e+00 2.784e-01 -5.269 1.37e-07 ***
price 1.767e-03 4.120e-04 4.289 1.80e-05 ***
housesize -7.286e-05 5.108e-05 -1.427 0.154
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 686.19 on 543 degrees of freedom
AIC: 692.19
Number of Fisher Scoring iterations: 4
Example (13/13)
![Page 231: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/231.jpg)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
231
![Page 232: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/232.jpg)
Assignment operators: ‘=’ Vs. ‘<-’
In R, you can use both ‘=’ and ‘<-‘ as assignment operators. So what’s the difference
between them and which one should you use?
232
![Page 233: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/233.jpg)
What’s the difference?
> mean(x=1:10)
[1] 5.5
> x
Error: object 'x' not found
> mean(x<-1:10)
[1] 5.5
> x
[1] 1 2 3 4 5 6 7 8 9 10
The main difference between the two assignment operators is scope. It’s easiest to see the
difference with an example:
Here x is declared within the function’s scope of the function, so it doesn’t exist in the user workspace.
This time the x variable is declared within the user workspace.
233
![Page 234: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/234.jpg)
When does the assignment take place? (1/2)
In the code above, you may be tempted
to thing that we “assign 1:10 to x, then
calculate the mean.” This would be
true for languages such as C, but it isn’t
true in R. Considering the function on
the right-hand side. Notice that the
value of a hasn’t changed!
> a <- 1
> f <- function(a) {
+ return(TRUE)
+ }
> f <- f(a <- a + 1); a
[1] 1
234
![Page 235: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/235.jpg)
When does the assignment take place? (2/2)
In R, the value of a will only change if
we need to evaluate the argument in
the function. This can lead to
unpredictable behaviour:
> f <- function(a) {
+ if (runif(1) > 0.5)
+ TRUE
+ else
+ a
+ }
> a <- 1
> f(a <- a+1); a
[1] 2
> f(a <- a+1); a
[1] 3
> f(a <- a+1); a
[1] TRUE
[1] 3 235
![Page 236: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/236.jpg)
Which one should I use? (1/2)
Well there’s quite a strong following for the “<-” operator:
• The Google R style guide prohibits the use of “=” for assignment.
• Hadley Wickham’s style guide recommends “<-“
• If you want your code to be compatible with S-plus you should use “<-”
(Note: it seems that S-plus now accepts “=” now).
• General R community recommends using “<-”
236
![Page 237: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/237.jpg)
Which one should I use? (2/2)
Some people use the “=” operator for the following reasons:
• The other languages use the “=” operator, e.g., python, C
• It’s quicker to type “=” and “<-“
• Wanting the declared variable to exist in the current workspace
• Using “=” avoids misleading expressions like if (x[1]<-2)
237
![Page 238: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/238.jpg)
Computer representation of numbers (1/2)
> a <- sqrt(2)
> a * a == 2
[1] FALSE
> a * a - 2
[1] 4.440892e-16
> all.equal(a * a, 2)
[1] TRUE
Real numbers are not stored exactly on
computers. Use binary version of
“scientific” notation, e.g., 1.24 × 102.
The function all.equal() compares two
objects using a numeric tolerance 1.5e-8(default). If you want much greater
accuracy than this you will need to
consider error propagation carefully.
238
![Page 239: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/239.jpg)
Computer representation of numbers (2/2)
> x<- seq(0,0.5,0.1)
> x
[1] 0.0 0.1 0.2 0.3 0.4 0.5
> y <- c(0,0.1,0.2,0.3,0.4,0.5)
> y
[1] 0.0 0.1 0.2 0.3 0.4 0.5
> x == y
[1] TRUE TRUE TRUE FALSE TRUE TRUE
> for (i in x) {
+ print(all.equal(x[i], y[i]))
+ }
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
239
![Page 240: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/240.jpg)
Assigning a value (1/2)
> x <- c(8, 6, 4)
> x[7] <- 10
> x
[1] 8 6 4 NA NA NA 10
Assigning a value to a nonexistent element
of a vector, matrix, array, or list will
expand that structure to accommodate the
new value.
240
![Page 241: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/241.jpg)
Assigning a value (2/2)
In R, the use of semicolons between statements is optional, and most people don't bother,
e.g.,
there is a risk that the first statement ended on the first line, i.e. that you said y <- 2 + 3
It is better you signal to R that an expression is incomplete, e.g.,
y <- 2 + 3
+ 5
y <- 2 + 3 +
5241
![Page 242: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/242.jpg)
Debugging with RStudio
Usually, I do no recommend you use R for
projects with many dependency files,
instead, calling R from other languages
such Java/Python/C++ for statistical
analysis would be a better solution.
Debugging with RStudio is very easy and
simple (similar to Matlab)
Detailed operations see here:
https://support.rstudio.com/hc/en-
us/articles/205612627-Debugging-with-
RStudio
242
![Page 243: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/243.jpg)
LaTeX
LaTeX is a document preparation system
for high-quality typesetting. It is freely
available for Windows, Mac, and Linux
platforms.
Donald E. Knuth
http://cs.stanford.edu/~uno/
https://latex-project.org/intro.html
243
![Page 244: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/244.jpg)
Sweave (R + LaTeX)
• Install LaTeX on your PC
• Install sweave library in Rstudio
• Download the SweaveDemo.rnw file
from Blackboard
• Open the file and compile the PDF as
shown on the right!
244
![Page 245: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/245.jpg)
R markdown
• Download the RMarkdownDemo.rmd file from Blackboard
• Open the file and compile the HTML as shown below!
245
![Page 246: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/246.jpg)
Other topics in R that are not covered in our lectures
• Rcpp: R and C++ mixed programming
• rJava: R and Java mixed programming
• Rpython: R and Python mixed programming
• Creating your own R package
• R for statistical modelling (gbm, etc.)
• R for machine learning (kernlab, Rweka, caret, nnet, etc.)
• R for time series analysis
• …
246
![Page 247: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly](https://reader035.vdocument.in/reader035/viewer/2022081607/5f0206d97e708231d402366b/html5/thumbnails/247.jpg)
Key references
• W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R.
• P. Teetor (2011) R Cookbook. O’Reilly.
• J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly
247