r programming language

23
Alberto Minetti

Upload: alberto-minetti

Post on 14-Apr-2017

313 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: R programming language

Alberto Minetti

Page 2: R programming language

What is R?• Functional programming language• Matrix-based• Interpreted (written in C and Fortran)• Environment for statistical computing and graphics• Open source and GPL license

• 6000+ packages in CRAN

Page 3: R programming language

Why use R?• Matrix calculation• Data visualization (interactive too)• Statistic analysis (regression, time series, geo-spatial)• Data mining, classification, clustering• Analysis of genomic data• Machine learning

Page 4: R programming language

Who uses R?• Oracle integrates R in its Big Data Appliance• IBM offers support for in-Hadoop execution of R• Data analysts for Google and Apple• 12° in TIOBE popularity index

Page 5: R programming language

How to use R?• Command-line interface, autonomous script or graphical front-ends

• Connection to any data source• Data analysis• Modeling and computation

• Data visualization• Fitting models or displaying data

Page 6: R programming language

R Studio IDE• licence AGPL 3

• Scripts• Workspace • Console• Images

Page 7: R programming language

Reading and writing data• From/To plain text files

• From/To Excel files

• From/To Databases

• From the Web

> heisenberg <- read.csv(file="simple.csv",head=TRUE,sep=",")> write.csv(x=data, file="simple.csv")

> library(gdata) > mydata = read.xls("mydata.xls")> write.xlsx(x=data, file="simple.csv«)

> library(XLConnect) > wk = loadWorkbook("mydata.xls") > df = readWorksheet(wk,sheet="Sheet1")

> library(RPostgreSQL)> con <- dbConnect(dbDriver("PostgreSQL"), dbname = "abc", user="postgres")> q <- dbGetQuery(con, "SELECT * FROM prices WHERE x > 0")> dbSendQuery(con, “INSERT INTO forecasts VALUE (10)")

> fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")

Page 8: R programming language

Programming features• Flow control statements• while, repeat, break, continue, if, return

• Exceptions, using try catch blocks• Functions• Default parameters• Positional or named arguments• Generic• Anonymous

fibonacci <- function(n) { if(n<=2) return 1 fib <- numeric(n) fib[1:2] <- 1 for(i in 3:n) { fib[i] <- fib[i-1] + fib[i-2] } return (fib[n])}arr <- function(a = 1, b = 2) { c(a, b)}> arr(b=6)[1] 1 6

f3 <- function(f) { f(3) }f3(function(x) {x*7})

`%my%` <- function(a,b) { return 2*a + 2 *b}

Page 9: R programming language

Correlation mpg hp cylMazda RX4 21.0 110 6Mazda RX4 Wag 21.0 110 6Datsun 710 22.8 93 4Hornet 4 Drive 21.4 110 6Hornet Sportabout 18.7 175 8Valiant 18.1 105 6Duster 360 14.3 245 8Merc 240D 24.4 62 4Merc 230 22.8 95 4Merc 280 19.2 123 6Merc 280C 17.8 123 6Merc 450SE 16.4 180 8Merc 450SL 17.3 180 8Merc 450SLC 15.2 180 8Cadillac Fleetwood 10.4 205 8Lincoln Continental 10.4 215 8Chrysler Imperial 14.7 230 8Fiat 128 32.4 66 4Honda Civic 30.4 52 4Toyota Corolla 33.9 65 4Toyota Corona 21.5 97 4Dodge Challenger 15.5 150 8AMC Javelin 15.2 150 8Camaro Z28 13.3 245 8Pontiac Firebird 19.2 175 8Fiat X1-9 27.3 66 4Porsche 914-2 26.0 91 4Lotus Europa 30.4 113 4Ford Pantera L 15.8 264 8Ferrari Dino 19.7 175 6Maserati Bora 15.0 335 8Volvo 142E 21.4 109 4

> mtcars2 <- subset(mtcars, select=c("mpg", "hp", "cyl"))

> pairs(mtcars2)

Page 10: R programming language

Auto-correlation

> A <- read.table(“http://cdiac.ornl.gov/ftp/trends/co2/maunaloa.co2”)> X=t(A[1,])> ts.plot(X)> acf(X)

Page 11: R programming language

Plotting> x <- seq(-1.57,1.57,by=.001)

> y <- (sqrt(abs(cos(x))) * cos(200*x) +sqrt(abs(x))-0.7) * (4-x * x)^0.01

> plot(0,0, type=‘n’,xlim=c(-2,+2),ylim=c(-1.6,+1.1))

> lines(x,y,col='pink')

> spread <- seq(1, length(x),length.out=length(x)/10)

> cols <- c('yellow','red','orange', 'purple')

> text(x[spread],y[spread], label='love', col=sample(rep(cols, length.out=length(spread))), cex=1)

Page 12: R programming language

Regression> library("MASS")> str(cats)'data.frame': 144 obs. of 3 variables: $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ... $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ... $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...attach(cats)> lm.out <- lm(Hwt ~ Bwt)Call:lm(formula = Hwt ~ Bwt)Coefficients:(Intercept) Bwt -0.3567 4.0341> plot(Hwt ~ Bwt, main="Kitty Cat Plot")> abline(lm.out, col="red")

Page 13: R programming language

Data manipulation: discretisation> clinical.trial <- data.frame(patient = 1:100, age= rnorm(100, mean = 60, sd = 8), year.enroll = sample(paste("19", 85:99, sep = ""), 100, replace = TRUE))> c1 <- cut(clinical.trial$age, breaks = 4)> table(c1)(41.1,50] (50,58.8] (58.8,67.6] (67.6,76.4] 9 34 41 16 > hist(clinical.trial$age, breaks=seq(40,100, by=10))

Page 14: R programming language

Plots from my MSc thesis• Prices of energy in the Italian

Power Exchange spot market• Forecast using a SARIMA model

Page 15: R programming language

Performances• Good performances with built-in math functions• Possibility to monitor the memory usage• Possibility to offload data to an external DB to speed up large operations• Functions for big data sets• Parallel computation

Page 17: R programming language

Vector part 1> x <- c(2,5,9.5,-3) #create a vector> x[2] #selects the second element[1] 5> x[c(2,4)] #select the elements in position 2 and 4[1] 5 -3> x[-c(1,3)] #keep out the elements in position 1 and 3[1] 5 -3> x[x>0] #select only positive elements[1] 2.0 5.0 9.5> x[!(x<=0)] #keep out the striclty not positve elements[1] 2.0 5.0 9.5

> x[x>0]-1 > x[x>0]+c(1,2,3) #sum element-wise[1] 1.0 4.0 8.5 [1] 3.0 7.0 11.0> x[x>0][2][1] 5

Page 18: R programming language

Vector part 2> which(x>0) #show the indexes that match the condition[1] 1 2 3> which.max(x) > which.min(x) > length(x)[1] 4 [1] 3 [1] 4

> x<-1:10 > paste(1:5, c("A","B"), sep="")[1] 1 2 3 4 5 6 7 8 9 10 [1] "1A" "2B" "3A" "4B" "5A"> x1<-seq(1,1000, length=10) #vector from 1 to 1000 with step 10[1] 1 112 223 334 445 556 667 778 889 1000> x2<-rep(2,times=10) #repeat 2 10 times[1] 2 2 2 2 2 2 2 2 2 2> rep(c(1,3),times=4) #repeat (1,3) 4 times[1] 1 3 1 3 1 3 1 3> rep(c(1,9),c(3,1)) #repeat (1,9) 3 and 1 times respectively[1] 1 1 1 9> length(c(x,x1,x2,3))[1] 31 #see also sort, order, eigen

Page 19: R programming language

Matrix part 1> x<-matrix(1:10,ncol=5) #create [,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 4 6 8 10

> x[,1] #select the first column[1] 1 2

> x[,4:5] #select columns 4 and 5 [,1] [,2][1,] 7 9[2,] 8 10

> cbind(1:2,c(1,-2),c(0,9)) #combine vectors by columns/rows (rbind) [,1] [,2] [,3][1,] 1 1 0[2,] 2 -2 9

> x[2,]<-rep(2,5) [,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 2 2 2 2

> x[2,] #select the second row[1] 2 4 6 8 10

> x[,-c(2,4)] #select columns 1 3 5 [,1] [,2] [,3][1,] 1 5 9[2,] 2 6 10

Page 20: R programming language

Matrix part 2> X<-diag(1:3) [,1] [,2] [,3][1,] 1 0 0[2,] 0 2 0[3,] 0 0 3

> solve(X) #the inverse of X [,1] [,2] [,3][1,] 1 0.0 0.0000000[2,] 0 0.5 0.0000000[3,] 0 0.0 0.3333333

> X%*%solve(X)#....verify [,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1

Page 21: R programming language

List, can contain different object types> lista<-list(matrix(1:9,nrow=3),rep(0,3),c(‘good’,’bad’))> length(lista)[1] 3> lista[[3]] #third element[1] ‘good’ ‘bad’> length(lista[[3]])[1] 2> lista[[2]]+2 #sum on the second item[1] 2 2 2> lista[[1]][2,2][1] 5> names(lista)<-c(‘first’, ‘second’, ‘third’) #names for elements> lista$second #or lista[[second]] return a vector[1] 0 0 0> lista["second"] #return a filtered list by the condition$second[1] 0 0 0

Page 22: R programming language

Multidimensional Array and named indexes> a<-array(1:24, dim=c(3,4,2))> dim(a) #show dimensions[1] 3 4 2> a[,,2] [,1] [,2] [,3] [,4][1,] 13 16 19 22[2,] 14 17 20 23[3,] 15 18 21 24> a[1,,] [,1] [,2][1,] 1 13[2,] 4 16[3,] 7 19[4,] 10 22> a[1,2,1][1] 4

> x<-matrix(1:10, ncol=5)> dimnames(x)<-list(c("X","Y"),NULL) [,1] [,2] [,3] [,4] [,5]X 1 3 5 7 9Y 2 4 6 8 10> dimnames(x)[[2]]<-c("g","h","j","j","k") g h j j kX 1 3 5 7 9Y 2 4 6 8 10

Summary of Data Structures

Linear Rectangular

Homogeneous Vectors Matrices

Heterogeneous Lists Data frames

Page 23: R programming language

Data frame> X<-data.frame(id=1:4, sex=c("M","F","F","M")) id sex1 1 M2 2 F3 3 F4 4 M> X$age<-c(2.5,3,5,6.2) id sex age1 1 M 2.52 2 F 3.03 3 F 5.04 4 M 6.2#X[X$age<3 | X$age>5, c("id","sex")]> subset(X,subset=(age<3 | age>5), select=-age) id sex1 1 M4 4 M #see also merge, attach

> summary(X) id sex age Min. :1.00 F:2 Min. :2.500 1st Qu.:1.75 M:2 1st Qu.:2.875 Median :2.50 Median :4.000 Mean :2.50 Mean :4.175 3rd Qu.:3.25 3rd Qu.:5.300 Max. :4.00 Max. :6.200