r programming groundup-basic-section-i
TRANSCRIPT
R-Programming –Basics R Programming
Ground Up!
Syed Awase Khirni
Syed Awase earned his PhD from University of Zurich in GIS, supported by EU V Framework Scholarship from SPIRIT
Project (www.geo-spirit.org). He currently provides consulting services through his startup www.territorialprescience.com
and www.sycliq.com
1 Copyright 2008-2016 Syed Awase Khirni TPRI
R-Programming –Basics
R Project
• R – Free Software environment for statistical computing and graphics.
• https://www.r-project.org
• https://cran.r-project.org/mirrors.html
Copyright 2008-2016 Syed Awase Khirni TPRI 2
R-Programming –Basics
S
• S Language – Developed by John Chambers et. al at Bell Labs
• 1976 -> internal statistical analysis environment – originally implemented as Fortran Libraries
• 1988-> Rewritten in C – statistical models in S by Chambers and Hastie
• 1998-> S v.4.0
• 1991-> R created in New Zealand by Ross Ihaka and Robert Gentleman.
• 1993 -> public release of R • 1995-> Martin Machler
convinced Ross and Robert to use the GNU GPU License
• 1996 , 1997 -> R Core Group Formed with (S Plus Core Group)
• 2000- R Version 1.0 Released • 2015 R Version 3.1.3 -> March
9, 2015.
Copyright 2008-2016 Syed Awase Khirni TPRI 3
R-Programming –Basics
Design of the R System
• R –Statistical Programming language based on S language developed by Bell Labs.
• Divided into 2 conceptual parts – Base – Add-on Packages
• Base – R System contains – The base package which is required
to run R and contains the most fundamental functions.
– Other packages contained in the base system include utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4
• Add-on Packages are packages that are published by either R Core group or any third party vendors
• Syntax similar to S, making it easy for S-PLUS users to switch over
• Semantics are superficially similar to S, but in reality are quite different
• Runs on almost any standard computing platform/OS
Copyright 2008-2016 Syed Awase Khirni TPRI 4
R-Programming –Basics
R?
• R is an integrated suite of software facilities for data manipulation, calculation and graphical display
• R has – Effective data handling and
storage facility – A suite of operators for
calculations on arrays and matrices
– A large, coherent, integrated collection of tools for data analysis
– Graphical facilities for data analysis and display
– A well developed, simple and effective programming language
Copyright 2008-2016 Syed Awase Khirni TPRI 5
R-Programming –Basics
R- Drawbacks
• Little built-in support for dynamic or 3-D graphics
• Functionality is based on consumer demand and user contributions
• Web support provided through third party software.
Copyright 2008-2016 Syed Awase Khirni TPRI 6
R-Programming –Basics
DATA TYPES AND BASIC OPERATIONS IN R
Copyright 2008-2016 Syed Awase Khirni TPRI 7
R-Programming –Basics
Data Types
• Objects • Numbers • Attributes • Entering Input and Printing • Vectors, Lists • Factors • Missing Values • Data Frames • Names
Copyright 2008-2016 Syed Awase Khirni TPRI 8
R-Programming –Basics
Objects in R
• R has five basic or atomic classes of objects – Character
– Numeric (real number)
– Integer
– Complex
– Logical (true/false)
• The most basic object is a vector – A vector can only contain objects of the same class
– The one exception is a list, which is represented as a vector but can contain objects of different classes
– Empty vectors can be created with the vector() function
Copyright 2008-2016 Syed Awase Khirni TPRI 9
R-Programming –Basics
R Studio
Copyright 2008-2016 Syed Awase Khirni TPRI 10
R-Programming –Basics
Install.packages()
• To install additional third party packages into your R software. We use
• Install.packages(“XLConnect”) – To install XLConnect
package
– To activate an already installed package we use • Library(“packagename”)
Copyright 2008-2016 Syed Awase Khirni TPRI 11
Check if the package is already installed or not. any(grepl("<name of your package>", installed.packages()))
R-Programming –Basics
Numbers in R
• Treated as numeric objects (i.e. double precision real numbers)
• Suffix L => integer
• Example : 1 => numeric object – 1L => explicitly gives an
integer
• 1/0 => inf (infinity)
• NaN => not a number or missing value
Copyright 2008-2016 Syed Awase Khirni TPRI 12
R-Programming –Basics
Attributes
• R objects can have attributes – Names, dimnames
– Dimensions (e.g. matrices, arrays)
– Class
– Length
– Other user-defined attributes/metadata
• Attributes of an object can be accessed using the attributes() function.
Copyright 2008-2016 Syed Awase Khirni TPRI 13
R-Programming –Basics
Assignment Operator (<-)
• Expressions in R are done using <- assignment operator.
• The grammar of the language determines whether an expression is complete or not
• The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored
• [1] indicates that x is a vector and 123781213412 is the first element
Copyright 2008-2016 Syed Awase Khirni TPRI 14
//auto printing
Ctrl+L to clear console
R-Programming –Basics
Vectors in R
• The c() function can be used to create vectors of objects.
Copyright 2008-2016 Syed Awase Khirni TPRI 15
R-Programming –Basics
Vectors in R
• Using the vector() function
Copyright 2008-2016 Syed Awase Khirni TPRI 16
R-Programming –Basics
Mixing Objects
• When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class.
Copyright 2008-2016 Syed Awase Khirni TPRI 17
R-Programming –Basics
Explicit Coercion
• Objects can be explicitly coerced from one class to another using the as.* functions.
Copyright 2008-2016 Syed Awase Khirni TPRI 18
R-Programming –Basics
Matrices • Vectors with a dimension
attribute are called Matrices. The dimension attribute is itself an integer vector of length 2(nrow, ncol)
• Matrices are constructed column-wise, so entries can be thought of starting from the upper left corner and running down the columns.
• Matrices can also be created directly from vectors by adding a dimension attribute.
Copyright 2008-2016 Syed Awase Khirni TPRI 19
R-Programming –Basics
Cbind-ing
• Matrices can be created by Column-binding with cbind() function
Copyright 2008-2016 Syed Awase Khirni TPRI 20
R-Programming –Basics
Rbind-ing
• Matrices can be created by row-binding using rbind() function.
Copyright 2008-2016 Syed Awase Khirni TPRI 21
R-Programming –Basics
Lists in R
• Lists are a special type of vector that can contain elements of different classes.
• Lists are a very important data type in R
Copyright 2008-2016 Syed Awase Khirni TPRI 22
R-Programming –Basics
Factors
• Used to represent categorical data. Factors can be unordered or ordered.
• Factors are treated specially by modelling functions like lm() and glm()
• Using factors with labels is better than using integers because factors are self-describing, having a variable that has values.
Copyright 2008-2016 Syed Awase Khirni TPRI 23
R-Programming –Basics
Missing Values
• Many existing, industrial and research datasets contain Missing values.
• These can occur due to various reasons such as manual data entry procedures, equipment errors and incorrect measurements.
• Missing values can appear in the form of outliers or even wrong data (i.e out of boundaries)
Copyright 2008-2016 Syed Awase Khirni TPRI 24
• Missing values are denoted by NA or NaN for undefined mathematical operations
– Is.na() is used to test objects if they are NA
– Is.nan() is used to test for NaN
– NA values have a class also, so there are integerNA, characterNA etc.
– A NaN value is also NA but the converse is not true.
R-Programming –Basics
Missing Values
• Three type of problems are usually associated with missing values
– Loss of efficiency
– Complications in handling and analyzing the data
– Bias resulting from differences between missing and complete data.
Copyright 2008-2016 Syed Awase Khirni TPRI 25
Identifying NA values using is.na() and is.nan()
R-Programming –Basics
Data Frames
• Used to store tabular data (table of values) – They are represented as a
special type of list, where every element of the list has to have the same length.
– Each element of the list can be thought of as a column and the length of each element of the list is the number of the rows
• Data frames can store different classes of objects in each column, while matrices must have every element of the same class
• Data frames also have a special attribute called row.names.
• Data frames are usually created by calling read.table() or read.csv()
• Can be converted to a matrix by calling data.matrix() method
Copyright 2008-2016 Syed Awase Khirni TPRI 26
R-Programming –Basics
Data Frames
Copyright 2008-2016 Syed Awase Khirni TPRI 27
R-Programming –Basics
Data Frame in R
Copyright 2008-2016 Syed Awase Khirni TPRI 28
R-Programming –Basics
Names in R
• R Objects can also have names, which is very useful for writing readable code and self-describing objects
Copyright 2008-2016 Syed Awase Khirni TPRI 29
R-Programming –Basics
Subsetting
• Extracting subsets from an existing dataset is called subsetting – []Always returns an
object of the same class as the original
– [[]]Used to extract elements of a list or a data frame.
– $ is used to extract element of a list or data frame by name; semantics are similar to that of [[]].
Copyright 2008-2016 Syed Awase Khirni TPRI 30
R-Programming –Basics
Subsetting Matrix
Copyright 2008-2016 Syed Awase Khirni TPRI 31
R-Programming –Basics
Subsetting List
Copyright 2008-2016 Syed Awase Khirni TPRI 32
R-Programming –Basics
Subsetting Nested Elements
Copyright 2008-2016 Syed Awase Khirni TPRI 33
R-Programming –Basics
Partial Matching
• Partial matching of names is allowed with [[]] and $
Copyright 2008-2016 Syed Awase Khirni TPRI 34
R-Programming –Basics
Remove NA values
• A common task is to remove missing value (NAs) prior to performing any analysis.
Copyright 2008-2016 Syed Awase Khirni TPRI 35
R-Programming –Basics
Vectorized Operations
• Many operations in R are vectorized making code more efficient, concise and easier to read.
Copyright 2008-2016 Syed Awase Khirni TPRI 36
R-Programming –Basics
Vectorized Matrix Operations
Copyright 2008-2016 Syed Awase Khirni TPRI 37
R-Programming –Basics
Reading Data
• R provides some useful functions to read data
– Read.table, read.csv for reading tabular data
– readLines, for reading lines of a text file
– Source: for reading in R code files (inverse of dump)
– dget: for reading in R code files (inverse of dput)
– Load: for reading in saved workspaces
– Unserialize, for reading single R objects in binary form.
Copyright 2008-2016 Syed Awase Khirni TPRI 38
R-Programming –Basics
Writing Data
• R provides a set of functions to write data into files
– Write.table: to write data in table format
– writeLines: to write lines
– Dump
– Dput
– Save
– serialize
Copyright 2008-2016 Syed Awase Khirni TPRI 39
R-Programming –Basics
Reading data files with read.table
• For small to moderately sized datasets, we can just call read.table without specifying any other arguments.
• Data <- read.table(“sampledata.txt”)
Copyright 2008-2016 Syed Awase Khirni TPRI 40
R-Programming –Basics
R-DataSets
• https://vincentarelbundock.github.io/Rdatasets/datasets.html
• http://openflights.org/data.html • http://www.public.iastate.edu/~hofmann/data_i
n_r_sortable.html • https://r-dir.com/reference/datasets.html • http://fimi.ua.ac.be/data/ • https://datamarket.com/data/list/?q=provider:ts
dl • https://www.data.gov/
Copyright 2008-2016 Syed Awase Khirni TPRI 41
R-Programming –Basics
Directory/get working directory
• Setting and getting the current working directory
Copyright 2008-2016 Syed Awase Khirni TPRI 42
> setwd("<path to your folder>")
R-Programming –Basics
Reading CSV files
Copyright 2008-2016 Syed Awase Khirni TPRI 43
R-Programming –Basics
Airmile data
Copyright 2008-2016 Syed Awase Khirni TPRI 44
R-Programming –Basics
Mocking sample data with mockaroo
Copyright 2008-2016 Syed Awase Khirni TPRI 45
https://www.mockaroo.com/
R-Programming –Basics
Reading large datasets with read.table
Copyright 2008-2016 Syed Awase Khirni TPRI 46
R-Programming –Basics
Write.csv()
• One of the easiest ways to save an R data frame is to write it to a csv file or tsv file or text file.
Copyright 2008-2016 Syed Awase Khirni TPRI 47
R-Programming –Basics
dput()
• Writes an ASCII text representation of an R object to a file or connection, or uses one to recreate the object
Copyright 2008-2016 Syed Awase Khirni TPRI 48
R-Programming –Basics
Head and Tail of DataSet
• Returns the first or the last part of an object , i.e. vector, matrix, table, data frame or function.
Copyright 2008-2016 Syed Awase Khirni TPRI 49
R-Programming –Basics
Loading “foreign” data
• Sometimes, we would like to import data from other statistical packages like SAS,SPSS and Stata
• Reading stata (.dta) files with foreign library
• Writing data files from R into Stata is also very straightforward.
Copyright 2008-2016 Syed Awase Khirni TPRI 50
R-Programming –Basics
Library”foreign”data
• SPSS Data
– For data files in SPSS format, it can be opened with the function read.spss from “foreign” package.
– “to.data.frame” option set to TRUE to return a data frame.
Copyright 2008-2016 Syed Awase Khirni TPRI 51
R-Programming –Basics
Loading “foreign”data
• Excel data
– Sometimes, we have data in xls format that needs to be imported into R prior to its use.
– Library(gdata)
Copyright 2008-2016 Syed Awase Khirni TPRI 52
R-Programming –Basics
Loading”foreign”data
• Using XLConnect package
• Install.packages(“XLConnect”);
Copyright 2008-2016 Syed Awase Khirni TPRI 53
R-Programming –Basics
Loading”foreign data”
• Minitab
– For importing minitab portable worksheets into R. We can use foreign library.
Copyright 2008-2016 Syed Awase Khirni TPRI 54
R-Programming –Basics
Computing Memory Requirements
• An integer takes 8bytes for numeric data type.
• Imagine you have a data frame with 100,000 rows and 100 columns.
• 100,000 X100X8bytes/numeric
– 220 bytes/MB
– Which accounts for 785 MB of memory is required.
Copyright 2008-2016 Syed Awase Khirni TPRI 55
R-Programming –Basics
Text Formats
• dump and dput are useful because the resulting textual format is editable and in the case of corruption, potentially recoverable
• In the case of writing out to a table or CSV file, dump and dput preserve the metadata (sacrificing some readability), so that another user doesn’t have to specify it all over again.
• Textual formats can work much better with version control programs like GIT and SVN, used to track changes meaningfully
• Text formats have longer life and adhere to “unix philosophy”
• However, the format is not very space-efficient.
Copyright 2008-2016 Syed Awase Khirni TPRI 56
R-Programming –Basics
Dump() function
• Creates a file in a format that can be read with the source() function or pasted in with the copy/paste edit functions of the windowing system.
Copyright 2008-2016 Syed Awase Khirni TPRI 57
R-Programming –Basics
Dput() function
• Dput function saves data as an R expression, which means that the resulting file can actually be copied and pasted into the R console.
• Creates and uses an ASCII file representing the object
• Writes an ASCII version of the object onto the file.
Copyright 2008-2016 Syed Awase Khirni TPRI 58
R-Programming –Basics
Functions in R
• Functions are a fundamental building block of R – Functions can be
assigned to variables
– Functions can be stored in lists,
– Functions can be passed as arguments to other functions
– Functions can have nested functions.
• Anonymous functions are functions that have no name.
• We use functions to incorporate sets of instructions that we want to use repeatedly or that because of their complexity, are better self-contained in a sub-program and called when needed.
Copyright 2008-2016 Syed Awase Khirni TPRI 59
R-Programming –Basics
User Defined Functions in R
• UDF are defined to accomplish a particular task and are not aware that a dedicated function or library exists already.
Copyright 2008-2016 Syed Awase Khirni TPRI 60
R-Programming –Basics
User Defined Functions in R
Copyright 2008-2016 Syed Awase Khirni TPRI 61
R-Programming –Basics
User Defined Functions in R
Copyright 2008-2016 Syed Awase Khirni TPRI 62
R-Programming –Basics
Infix Operators in R
• They are unique functions and methods that facilitate basic data expressions or transformations.
• They refer to the placement of the arithmetic operator between variables.
• The types of infix operators used in R include functions for data extraction, arithmetic sequences, comparison, logical testings, variable assignments and custom data functions
Copyright 2008-2016 Syed Awase Khirni TPRI 63
R-Programming –Basics
Infix Operator in R
• Infix operators, are used between operands, these operators do a function call in the background.
Copyright 2008-2016 Syed Awase Khirni TPRI 64
R-Programming –Basics
Predefined infix Operators in R
Operator Rank Description
%% 6 Reminder operator
%/% Integer Division
%*% 6 Matrix Multiplication
%o% 6 Outer Product
%x% 6 Kronecker product
%in% 9 Matching operator
:: 1 Extract -> extract function from a package namespace.
::: 1 Extract-> extract a hidden function from a namespace
$ 2 Extract list subset, extract list data by name
@ 2 Extract attributes by memory slot or location.
[[]] 3 Extract data by index
Copyright 2008-2016 Syed Awase Khirni TPRI 65
R-Programming –Basics
Predefined infix operators in R
Operator Rank Description
^ 4 Arithmetic Exponential Operator
: 5 Generate sequence of number
! 8 Not/Negation Operator
Xor 10 Logical/Exclusive OR
& 10 Logical and element
&& 10 Logical and control
~ 11 Assignment(equal) used in formals and model building
<<- 12 Permanent Assignment
<- 13 Left assignment
-> 13 Right assignment
Copyright 2008-2016 Syed Awase Khirni TPRI 66
R-Programming –Basics
User Defined infix in R
Copyright 2008-2016 Syed Awase Khirni TPRI 67
R-Programming –Basics
User defined infix function in R
Copyright 2008-2016 Syed Awase Khirni TPRI 68
R-Programming –Basics
CONTROL FLOW IN R SYED AWASE KHIRNI
Copyright 2008-2016 Syed Awase Khirni TPRI 69
R-Programming –Basics
If If..else
Copyright 2008-2016 Syed Awase Khirni TPRI 70
R-Programming –Basics
Ifelse()
• Vectors form the basic building block of R programming.
• Most functions in R take vector as input and output a resultant vector
• Vectorization of code will be much faster than applying the same function to each element of the vector individually.
• Ifelse() is a vector equivalent of if..else statement
• Test_expression must be a logical vector (or an object that can be coerced to logical)
• Return value is a vector with the same length as test_expression
Copyright 2008-2016 Syed Awase Khirni TPRI 71
R-Programming –Basics
forloop
Copyright 2008-2016 Syed Awase Khirni TPRI 72
R-Programming –Basics
While
Copyright 2008-2016 Syed Awase Khirni TPRI 73
R-Programming –Basics
Break Next
Copyright 2008-2016 Syed Awase Khirni TPRI 74
R-Programming –Basics
Repeat Loop
• A repeat loop is used to iterate over a block of code multiple number of time
• There is no condition check in repeat loop to exit the loop
• We must put a condition explicitly inside the body of the loop and use the break statement to exit the loop
Copyright 2008-2016 Syed Awase Khirni TPRI 75
R-Programming –Basics
OBJECTS AND CLASSES IN R SYED AWASE KHIRNI
Copyright 2008-2016 Syed Awase Khirni TPRI 76
R-Programming –Basics
OOP in R
• An object is a data structure have some attributes and methods which act on the attributes
• A class is a blue print for the object.
• R has three(3) class systems
– S3 Class System
– S4 Class System
– Reference Class System
Copyright 2008-2016 Syed Awase Khirni TPRI 77
R-Programming –Basics
S3 Class System
• Primitive in nature
• Lacks a formal definition and object of this class can be simply created by adding a class attribute.
• Objects are created by setting the class attribute
• Attributes are accessed using $
• Methods belong to generic function
• Follows copy-on-modify semantics
S4 Class System
• A formally defined structure which helps in making object of the same class look more or less similar.
• Class components are properly defined using the setClass() function and objects are created using the new() function.
• Attributes are accessed using @
• Methods belong to generic function
• Follows copy-on-modify semantics
Copyright 2008-2016 Syed Awase Khirni TPRI 78
R-Programming –Basics
Reference Class System
• Similar to the object oriented programming we are used to in C# and Java.
• Basically an extension of S4 class system with an environment added to it.
• Reference Class System
– Class defined using SetRefClass()
– Objects are created using generator functions
– Attributes are accessed using $
– Methods belong to the class
– Does not follow copy-on-modify semantics
Copyright 2008-2016 Syed Awase Khirni TPRI 79
R-Programming –Basics
S3 Class System
Copyright 2008-2016 Syed Awase Khirni TPRI 80
R-Programming –Basics
S3 Class
Copyright 2008-2016 Syed Awase Khirni TPRI 81
R-Programming –Basics
S3 Class Method
Copyright 2008-2016 Syed Awase Khirni TPRI 82
R-Programming –Basics
S3 class with methods
Copyright 2008-2016 Syed Awase Khirni TPRI 83
R-Programming –Basics
Inheritance – S3 Class System
Copyright 2008-2016 Syed Awase Khirni TPRI 84
R-Programming –Basics
S4 Class System in R
• S4 class is defined using the setClass() function
• Member variables are called slots
• When defining a class, we need to set the name and the slots (along with class of the slot)
Copyright 2008-2016 Syed Awase Khirni TPRI 85
R-Programming –Basics
S4 Class System in R
Accessing Slots
• Slots of an object are accessed using @
Modifying Slots
Copyright 2008-2016 Syed Awase Khirni TPRI 86
• A slot can be modified through reassignment operations as shown below
R-Programming –Basics
Inheritance in S4
Copyright 2008-2016 Syed Awase Khirni TPRI 87
R-Programming –Basics
R Reference Class System
• Reference class in R are similar to the object oriented programming, we are used to seeing in C++, Java, Python.
• Unlike S3 and S4 classes, methods belong to class rather than generic functions.
• Reference class are internally implemented as S4 classes with an environment added to it.
• setRefClass() returns a generator function which is used to create objects of that class
Copyright 2008-2016 Syed Awase Khirni TPRI 88
R-Programming –Basics
Reference Class in R
Accessing Fields in R
• Fields of the object can be accessed using the $ operator
Modifying Fields in R
Copyright 2008-2016 Syed Awase Khirni TPRI 89
• Fields can be modified by reassignment
R-Programming –Basics
Copyright 2008-2016 Syed Awase Khirni TPRI 90
R-Programming –Basics
Reference Methods .copy()
Copyright 2008-2016 Syed Awase Khirni TPRI 91
R-Programming –Basics
Reference Methods
Copyright 2008-2016 Syed Awase Khirni TPRI 92
R-Programming –Basics
Inheritance in Reference Class
Copyright 2008-2016 Syed Awase Khirni TPRI 93
R-Programming –Basics
Contact Us
Thank You We also provide Code Driven Open House Trainings
94 © Syed Awase 2008- 16 TPRI
For code driven trainings Reach out to us +91-9035433124
Current Offerings
• AngularJS 1.5.x • Typescript • AngularJS 2 (with NodeJS) • KnockOutJS (with NodeJS) • BackBoneJS (with NodeJS) • Ember JS / Ext JS (with NodeJS) • Raspberry Pi • Responsive Web Design with Bootstrap, Google Material Design and KendoUI • C# ASP.NET MVC • C# ASP.NET WEB API • C# ASP.NET WCF, WPF • JAVA , SPRING, HIBERNATE • Python , Django • R Statistical Programming • Android Programming • Python/Django • Ruby on Rails
INDIA HYDERABAD | BANGALORE | CHENNAI | PUNE
OVERSEAS
SINGAPORE | MALAYSIA | DUBAI