r programming language

Alberto Minetti

What is R?• Functional programming language• Matrix-based• Interpreted (written in C and Fortran)• Environment for statistical computing and graphics• Open source and GPL license

• 6000+ packages in CRAN

Why use R?• Matrix calculation• Data visualization (interactive too)• Statistic analysis (regression, time series, geo-spatial)• Data mining, classification, clustering• Analysis of genomic data• Machine learning

Who uses R?• Oracle integrates R in its Big Data Appliance• IBM offers support for in-Hadoop execution of R• Data analysts for Google and Apple• 12° in TIOBE popularity index

How to use R?• Command-line interface, autonomous script or graphical front-ends

• Connection to any data source• Data analysis• Modeling and computation

• Data visualization• Fitting models or displaying data

R Studio IDE• licence AGPL 3

• Scripts• Workspace • Console• Images

Reading and writing data• From/To plain text files

• From/To Excel files

• From/To Databases

• From the Web

> heisenberg <- read.csv(file="simple.csv",head=TRUE,sep=",")> write.csv(x=data, file="simple.csv")

> library(gdata) > mydata = read.xls("mydata.xls")> write.xlsx(x=data, file="simple.csv«)

> library(XLConnect) > wk = loadWorkbook("mydata.xls") > df = readWorksheet(wk,sheet="Sheet1")

> library(RPostgreSQL)> con <- dbConnect(dbDriver("PostgreSQL"), dbname = "abc", user="postgres")> q <- dbGetQuery(con, "SELECT * FROM prices WHERE x > 0")> dbSendQuery(con, “INSERT INTO forecasts VALUE (10)")

> fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")

Programming features• Flow control statements• while, repeat, break, continue, if, return

• Exceptions, using try catch blocks• Functions• Default parameters• Positional or named arguments• Generic• Anonymous

fibonacci <- function(n) { if(n<=2) return 1 fib <- numeric(n) fib[1:2] <- 1 for(i in 3:n) { fib[i] <- fib[i-1] + fib[i-2] } return (fib[n])}arr <- function(a = 1, b = 2) { c(a, b)}> arr(b=6)[1] 1 6

f3 <- function(f) { f(3) }f3(function(x) {x*7})

`%my%` <- function(a,b) { return 2*a + 2 *b}

Correlation mpg hp cylMazda RX4 21.0 110 6Mazda RX4 Wag 21.0 110 6Datsun 710 22.8 93 4Hornet 4 Drive 21.4 110 6Hornet Sportabout 18.7 175 8Valiant 18.1 105 6Duster 360 14.3 245 8Merc 240D 24.4 62 4Merc 230 22.8 95 4Merc 280 19.2 123 6Merc 280C 17.8 123 6Merc 450SE 16.4 180 8Merc 450SL 17.3 180 8Merc 450SLC 15.2 180 8Cadillac Fleetwood 10.4 205 8Lincoln Continental 10.4 215 8Chrysler Imperial 14.7 230 8Fiat 128 32.4 66 4Honda Civic 30.4 52 4Toyota Corolla 33.9 65 4Toyota Corona 21.5 97 4Dodge Challenger 15.5 150 8AMC Javelin 15.2 150 8Camaro Z28 13.3 245 8Pontiac Firebird 19.2 175 8Fiat X1-9 27.3 66 4Porsche 914-2 26.0 91 4Lotus Europa 30.4 113 4Ford Pantera L 15.8 264 8Ferrari Dino 19.7 175 6Maserati Bora 15.0 335 8Volvo 142E 21.4 109 4

> mtcars2 <- subset(mtcars, select=c("mpg", "hp", "cyl"))

> pairs(mtcars2)

Auto-correlation

> A <- read.table(“http://cdiac.ornl.gov/ftp/trends/co2/maunaloa.co2”)> X=t(A[1,])> ts.plot(X)> acf(X)

Plotting> x <- seq(-1.57,1.57,by=.001)

> y <- (sqrt(abs(cos(x))) * cos(200*x) +sqrt(abs(x))-0.7) * (4-x * x)^0.01

> plot(0,0, type=‘n’,xlim=c(-2,+2),ylim=c(-1.6,+1.1))

> lines(x,y,col='pink')

> spread <- seq(1, length(x),length.out=length(x)/10)

> cols <- c('yellow','red','orange', 'purple')

> text(x[spread],y[spread], label='love', col=sample(rep(cols, length.out=length(spread))), cex=1)

Regression> library("MASS")> str(cats)'data.frame': 144 obs. of 3 variables: $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ... $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ... $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...attach(cats)> lm.out <- lm(Hwt ~ Bwt)Call:lm(formula = Hwt ~ Bwt)Coefficients:(Intercept) Bwt -0.3567 4.0341> plot(Hwt ~ Bwt, main="Kitty Cat Plot")> abline(lm.out, col="red")

Data manipulation: discretisation> clinical.trial <- data.frame(patient = 1:100, age= rnorm(100, mean = 60, sd = 8), year.enroll = sample(paste("19", 85:99, sep = ""), 100, replace = TRUE))> c1 <- cut(clinical.trial$age, breaks = 4)> table(c1)(41.1,50] (50,58.8] (58.8,67.6] (67.6,76.4] 9 34 41 16 > hist(clinical.trial$age, breaks=seq(40,100, by=10))

Plots from my MSc thesis• Prices of energy in the Italian

Power Exchange spot market• Forecast using a SARIMA model

Performances• Good performances with built-in math functions• Possibility to monitor the memory usage• Possibility to offload data to an external DB to speed up large operations• Functions for big data sets• Parallel computation

Credits• http://adv-r.had.co.nz/• http://cran.r-project.org/

• http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/ • https://kaosktrl.wordpress.com/2010/02/04/r-lanalisi-delle-serie-stori

che-partendo-da-copenaghen/

http://adv-r.had.co.nz/

http://cran.r-project.org/

http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/

http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/

https://kaosktrl.wordpress.com/2010/02/04/r-lanalisi-delle-serie-storiche-partendo-da-copenaghen/

https://kaosktrl.wordpress.com/2010/02/04/r-lanalisi-delle-serie-storiche-partendo-da-copenaghen/

Vector part 1> x <- c(2,5,9.5,-3) #create a vector> x[2] #selects the second element[1] 5> x[c(2,4)] #select the elements in position 2 and 4[1] 5 -3> x[-c(1,3)] #keep out the elements in position 1 and 3[1] 5 -3> x[x>0] #select only positive elements[1] 2.0 5.0 9.5> x[!(x<=0)] #keep out the striclty not positve elements[1] 2.0 5.0 9.5

> x[x>0]-1 > x[x>0]+c(1,2,3) #sum element-wise[1] 1.0 4.0 8.5 [1] 3.0 7.0 11.0> x[x>0][2][1] 5

Vector part 2> which(x>0) #show the indexes that match the condition[1] 1 2 3> which.max(x) > which.min(x) > length(x)[1] 4 [1] 3 [1] 4

> x<-1:10 > paste(1:5, c("A","B"), sep="")[1] 1 2 3 4 5 6 7 8 9 10 [1] "1A" "2B" "3A" "4B" "5A"> x1<-seq(1,1000, length=10) #vector from 1 to 1000 with step 10[1] 1 112 223 334 445 556 667 778 889 1000> x2<-rep(2,times=10) #repeat 2 10 times[1] 2 2 2 2 2 2 2 2 2 2> rep(c(1,3),times=4) #repeat (1,3) 4 times[1] 1 3 1 3 1 3 1 3> rep(c(1,9),c(3,1)) #repeat (1,9) 3 and 1 times respectively[1] 1 1 1 9> length(c(x,x1,x2,3))[1] 31 #see also sort, order, eigen

Matrix part 1> x<-matrix(1:10,ncol=5) #create [,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 4 6 8 10

> x[,1] #select the first column[1] 1 2

> x[,4:5] #select columns 4 and 5 [,1] [,2][1,] 7 9[2,] 8 10

> cbind(1:2,c(1,-2),c(0,9)) #combine vectors by columns/rows (rbind) [,1] [,2] [,3][1,] 1 1 0[2,] 2 -2 9

> x[2,]<-rep(2,5) [,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 2 2 2 2

> x[2,] #select the second row[1] 2 4 6 8 10

> x[,-c(2,4)] #select columns 1 3 5 [,1] [,2] [,3][1,] 1 5 9[2,] 2 6 10

Matrix part 2> X<-diag(1:3) [,1] [,2] [,3][1,] 1 0 0[2,] 0 2 0[3,] 0 0 3

> solve(X) #the inverse of X [,1] [,2] [,3][1,] 1 0.0 0.0000000[2,] 0 0.5 0.0000000[3,] 0 0.0 0.3333333

> X%*%solve(X)#....verify [,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1

List, can contain different object types> lista<-list(matrix(1:9,nrow=3),rep(0,3),c(‘good’,’bad’))> length(lista)[1] 3> lista[[3]] #third element[1] ‘good’ ‘bad’> length(lista[[3]])[1] 2> lista[[2]]+2 #sum on the second item[1] 2 2 2> lista[[1]][2,2][1] 5> names(lista)<-c(‘first’, ‘second’, ‘third’) #names for elements> lista$second #or lista[[second]] return a vector[1] 0 0 0> lista["second"] #return a filtered list by the condition$second[1] 0 0 0

Multidimensional Array and named indexes> a<-array(1:24, dim=c(3,4,2))> dim(a) #show dimensions[1] 3 4 2> a[,,2] [,1] [,2] [,3] [,4][1,] 13 16 19 22[2,] 14 17 20 23[3,] 15 18 21 24> a[1,,] [,1] [,2][1,] 1 13[2,] 4 16[3,] 7 19[4,] 10 22> a[1,2,1][1] 4

> x<-matrix(1:10, ncol=5)> dimnames(x)<-list(c("X","Y"),NULL) [,1] [,2] [,3] [,4] [,5]X 1 3 5 7 9Y 2 4 6 8 10> dimnames(x)[[2]]<-c("g","h","j","j","k") g h j j kX 1 3 5 7 9Y 2 4 6 8 10

Summary of Data Structures

Linear Rectangular

Homogeneous Vectors Matrices

Heterogeneous Lists Data frames

Data frame> X<-data.frame(id=1:4, sex=c("M","F","F","M")) id sex1 1 M2 2 F3 3 F4 4 M> X$age<-c(2.5,3,5,6.2) id sex age1 1 M 2.52 2 F 3.03 3 F 5.04 4 M 6.2#X[X$age<3 | X$age>5, c("id","sex")]> subset(X,subset=(age<3 | age>5), select=-age) id sex1 1 M4 4 M #see also merge, attach

> summary(X) id sex age Min. :1.00 F:2 Min. :2.500 1st Qu.:1.75 M:2 1st Qu.:2.875 Median :2.50 Median :4.000 Mean :2.50 Mean :4.175 3rd Qu.:3.25 3rd Qu.:5.300 Max. :4.00 Max. :6.200

r programming language

Engineering