with r introduction to statistical computing...introduction to statistical computing with r june 30,...

15
Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD () Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the SIIM 2016 Learning Lab: “Introduction to Statistical Computing with R.” The goal of this session is to familiarize everyone with the power of R, a free statistical software package. For this session, you will need the R software package, which can be downloaded at: (http://www.r- project.org/)http://www.r-project.org/ (http://www.r-project.org/). Additional optional, but recommended packages include R studio (http://www.rstudio.com/products/rstudio/download/) R markdown i n s t a l l . p a c k a g e s ( " r m a r k d o w n " ) Advanced session Many additional features of R will be convered in the Advanced session. These include abilities such as interfacing with command-line functions, advanced scripting, more complex plotting, and DICOM analysis. Getting help with R Help for R functions is availble in R itself. For example, to get help for the function to list files in the current directory type h e l p ( d i r ) . Additionally, you can get an example of how to use a function by typing e x a m p l e ( . A shortcut command uses the question mark, such as ? d i r For more complete discussions and examples of complex or grapical functions, the website R Seek (http://rseek.org) is a fantastic resource. Basic mathematical operations The simplest commands in R revolve around the operations of numbers. Most commands are typed just as they would be written.

Upload: others

Post on 04-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

Introduction to Statistical Computingwith RJune 30, 2016

Joe Wildenberg MD/PhD

()

Eric Schmitt MD/PhD Tessa Cook MD/PhD

Welcome to the SIIM 2016 Learning Lab: “Introduction to Statistical Computing with R.” The goal of thissession is to familiarize everyone with the power of R, a free statistical software package.

For this session, you will need the R software package, which can be downloaded at: (http://www.r-project.org/)http://www.r-project.org/ (http://www.r-project.org/).

Additional optional, but recommended packages include

R studio (http://www.rstudio.com/products/rstudio/download/)R markdown install.packages("rmarkdown")

Advanced session Many additional features of R will be convered in the Advanced session. These includeabilities such as interfacing with command-line functions, advanced scripting, more complex plotting, andDICOM analysis.

Getting help with RHelp for R functions is availble in R itself. For example, to get help for the function to list files in the currentdirectory type help(dir). Additionally, you can get an example of how to use a function by typingexample(dir). A shortcut command uses the question mark, such as ?dir

For more complete discussions and examples of complex or grapical functions, the website R Seek(http://rseek.org) is a fantastic resource.

Basic mathematical operationsThe simplest commands in R revolve around the operations of numbers. Most commands are typed justas they would be written.

Page 2: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

2+5

## [1] 7

2*5

## [1] 10

2/5

## [1] 0.4

More complex operations usually have their own function. For example, to take the square root:

sqrt(2)

## [1] 1.414214

To perform calculations on multiple numbers, we need to tell R that the individual numbers are part of agrouping called a vector.

c(1,2,3)

## [1] 1 2 3

We can utilize some built in functions to calculate the mean and standard deviation of a vector.

mean(c(1,2,3,4,5))

## [1] 3

sd(c(1,2,3,4,5))

## [1] 1.581139

Variables - storing and referencingIn order to start performing complex operations, we need to create and store some data. Let’s use thec() function to store some numbers. Note that these two ways of storing the data to a variable are

identical.

Page 3: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

data = c(1,2,4,8,16) data <- c(1,2,4,8,16) data

## [1] 1 2 4 8 16

Variables in R are primarily referenced by using square brackets. The first value is specified starting withthe value ‘1’

data[1]

## [1] 1

It is also possible to get multiple values all at once. We concatenate the indices of interest within thebrackets. For example, say we wanted the the 1st and 4th data points:

data[c(1,4)]

## [1] 1 8

More complex ways of referencing data will be discussed below.

Loading data into RMost people using R will have external data to load prior to manipulation or statistical analysis. Loadingdata from a text file or comma-delimited file (csv) is built into R. The ability to import from other file types(Excel, SPSS, SAS, Stata, Matlab, etc.) requires an additional package. See below for instructions.

The first step in loading data is to navigate to the correct directory where the data is stored.

The data for this tutorial can be downloaded at:https://dl.dropboxusercontent.com/u/91489998/sampleData.csv(https://dl.dropboxusercontent.com/u/91489998/sampleData.csv)

1. Find your current working directory

getwd()

## [1] "/Users/User/R Course/SIIM 2016/Intro"

2. Change to the directory where the data is stored, in our case, on the desktop.

setwd("/Users/username/Desktop/")

3. Load the data

Page 4: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

mydata = read.csv("sampleData.csv")

Note that our csv file uses the default delimiter, a comma. Instructions for changing the default values forthis, and all functions, are specified in the help page. Additional built-in utilities for importing data case befound through help(read.table).

Packages needed for importing proprietary formats must be separately installed. Please see the LoadingPackage section for instructions on how to install external R packages.

4. Check to make certain the data imported as expected. This function provides a summary of thevariable.

str(mydata)

Page 5: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

## 'data.frame': 10992 obs. of 21 variables: ## $ AccessionNumber : Factor w/ 10992 levels "6700859","7170145",..: 10315 10545 10809 10250 10844 10636 10781 10637 10243 10463 ... ## $ StudyDate : int 20131106 20121111 20121119 20121105 20121119 20121113 20121117 20121113 20121104 20121109 ... ## $ RequestingLastName : Factor w/ 1407 levels "Aaron","Abramson",..: 220 243 827 270 1325 1390 266 1390 1194 1026 ... ## $ RequestingFirstName : Factor w/ 786 levels "Aaron","Abbie",..: 659 365 214 519 469 774 254 774 558 372 ... ## $ ResponsibleLastName : Factor w/ 103 levels "Alvarado","Amon",..: 14 77 13 58 13 89 100 41 13 82 ... ## $ ResponsibleFirstName: Factor w/ 89 levels "Alan","Alfredo",..: 15 89 53 28 53 32 64 22 53 48 ... ## $ PatientID : Factor w/ 909 levels "1.00E+15","1.01E+15",..: 886 404 796 115 886 411 441 411 374 600 ... ## $ PatientsBirthDate : Factor w/ 7830 levels "1/1/1900","1/1/26",..: 1277 7574 2006 5374 1636 1 482 1 7360 212 ... ## $ PatientAge : int 57 78 67 90 28 130 57 130 53 39 ... ## $ PatientWeight : num 0.001 0.001 0.001 69.4 92.986 ... ## $ PatientSex : Factor w/ 3 levels "F","M","O": 2 1 1 2 1 1 2 1 1 2 ... ## $ PatientLocationCode : Factor w/ 489 levels "APUO ","CCU ",..: 228 243 344 352 269 453 243 453 243 270 ... ## $ StudyTime : num 110719 162525 61060 103744 200005 ... ## $ CodeValue : Factor w/ 183 levels "CAAO","CACS",..: 69 173 176 109 176 176 111 111 133 112 ... ## $ CodeMeaning : Factor w/ 271 levels "20982-RF BONE TUMOR ABLATION-CT GUIDED",..: 49 163 161 82 161 161 86 86 192 81 ... ## $ MaxCTDI : num 22.8 56.3 60.6 22.8 58.4 ... ## $ TotalmAs : int 4877 3452 2338 16156 2695 10568 4670 10568 694 7879 ... ## $ ReportedTotalDLP : num 1154 937 1143 1171 999 ... ## $ Subspecialty : Factor w/ 7 levels "BODY CT","CARDIO",..: 1 7 7 2 7 7 1 1 7 1 ... ## $ BodySite : Factor w/ 13 levels "ABD-PELVIS","ABDOMEN",..: 1 7 7 1 7 7 1 1 10 1 ... ## $ EstimatedDose : num 17.31 1.97 2.4 17.57 2.1 ...

DatatypesIn our case, the variable is a dataframe. A dataframe is a collection of other objects with minimalrestictions. Basic data types in R include:

intnumlogicalvectorsmatrix/arraydataframes/lists

Page 6: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

factors

Data FramesDataframes allow multiple different types of data to be stored in a meaninful way. Referencing this datacan be done in different ways:

age1 = mydata[[9]] age2 = mydata[["PatientAge"]] age3 = mydata$PatientAge all.equal(age1,age2,age3)

## [1] TRUE

You can see that the first two examples use double brackets to reference the data. This extracts the datafrom the dataframe, while single brackets returns a new dataframe but only with the selected data.

FactorsA factor is a type of array in which each category is assigned an integer, and only the numbers are actuallystored at each index. An example will be shown below.

Factors simplify the storing and manipulation of data where there is a limited number of categories. Forexample, there are only a limited number values we expect for gender.

gender = mydata[["PatientSex"]] summary(gender)

## F M O ## 5560 5431 1

Those who are using R Studio will see in their “Environment” tab that the gender value says the levels andthen displays the data as “2 1 1 2 1…” This is how R stores the actual data, and uses an index toassociate that 1=F and 2=M.

Specific data extractionWe can also use a built-in search function to find and then reference paticular data of interest. We can seeabove that there is a study that does not have the gender classified as M or F, instead listed as ‘O’. Wecan find the index of that study within the gender data.

which(gender=='O',arr.ind=TRUE)

## [1] 1468

Page 7: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

This output can be used as input to investigate why that patient does not have an assigned gender. Let’scheck the location the patient came from.

location = mydata[["PatientLocationCode"]] location[which(gender=='O',arr.ind=TRUE)]

## [1] TRAUMA ## 489 Levels: APUO CCU CCUCCU02 CCUCCU03 ... XIRO

As you can see, that patient was from TRAUMA and the gender information was likely unknown (at least tothe system) at the time of the scan.

Logicals and LoopsWe often need to perform a logical test to compute a statistic or as a decision point in a calculation.Logical variables can either be TRUE or FALSE. We can use these types of variables to perform a booleantest.

val = TRUE if(val){message("That was TRUE")}

## That was TRUE

We can also use this type of test to perform different activities depending on the result

weightKg = mydata$PatientWeight weightLb = 2.2 * weightKg if(max(weightLb) < 400){ message("Normal mAs") } else{ message("We need more power!") }

## We need more power!

We can combine this if statement with a loop to scan information. Let’s find the age of the first 10 CTscans of the abdomen and pelvis.

Page 8: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

age = mydata[["PatientAge"]] site = mydata[["BodySite"]] genCT = NA num = 0 for (i in 1:length(site)){ if (site[i] == "ABD-PELVIS") { num = num + 1 genCT[num] = age[i] } if(num > 9){ break } } genCT

## [1] 57 90 57 130 39 28 30 29 60 61

We can also combine if statments and loops to update information. For example, the patient ages werecalculated at the time of the scan. What if we wanted to know the current ages of the patients?

age = mydata[["PatientAge"]] birthData = mydata[["PatientsBirthDate"]] studyDate = mydata[["StudyDate"]]

These results from the second patient show that the scan (and age calcualtion) was performed in 2012.

Patient’s birthday: 9/26/34Study date: 20121111Age: 78

Now lets recalculate all of the patient ages to account for today’s date using a for loop. This, unfortunately,requires several conversions.

birthDataCon = strptime(birthData,"%m/%d/%Y") ageCon = 1:length(birthDataCon) todayYear = as.numeric(strftime(Sys.Date(),format="%Y")) todayDay = as.numeric(strftime(Sys.Date(),format="%j")) for (i in 1:length(birthDataCon)){ ageCon[i] = todayYear - 1900 - as.numeric(strftime(birthDataCon[i],format="%Y")) - 1 if (todayDay > as.numeric(strftime(birthDataCon[i],format="%j"))){ ageCon[i] = ageCon[i] + 1 } }

The above code is complex, but we can break it down.

1. First, we have to convert the date notation in our data into something R can read.2. Loop over all of the studies.

Page 9: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

3. Get the year from the current date and the year from the birth date data. These are initially returnedas text instead of numbers so do another conversion before subtracting them.

4. Add 1 to the age if the current day of the year is after the patient’s birthday.

Now lets check if we are correct

Patient’s birthday: 9/26/34Todays date: 2016-06-01Age: 81

PlottingR has multiple functions that make plotting very simple. For example, to see how weight and calculateddose co-vary on CTs of the abdomen and pelvis, we can create a scatter plot. First, we have to select onlythe examinations that are of the abdomen and pelvis.

weight = mydata[["PatientWeight"]] dose = mydata[["EstimatedDose"]] weight2 = weight[which(site=="ABD-PELVIS",arr.ind=TRUE)] dose2 = dose[which(site=="ABD-PELVIS",arr.ind=TRUE)] plot(weight2, dose2, xlab="Weight", ylab="Dose")

Page 10: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

We can see an immediate problem. The large number of points along the weight = 0 line is due to missingweights (actually, 0.001 kg). We next have to filter out those - use a cutoff of 10 as some are entered as afew Kg. Also, let’s add a trendline.

weight3 = weight2[which(weight2 > 10,arr.ind=TRUE)] dose3 = dose2[which(weight2 > 10,arr.ind=TRUE)] plot(weight3, dose3, xlab="Weight", ylab="Dose") abline(lm(dose3 ~ weight3))

We can use a histogram to see what the weight distribution of patients getting a CT of the abdomen andpelvis. Let’s group them in 10 Kg buckets. Note that we have a few outliers that we don’t really care aboutmapping individually, so after 200 just skip to 300.

hist(weight3, c(seq(10, 200, by=10),300), freq=FALSE, xlim=c(0,200))

Page 11: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

Statistical calculationsCorrelationOne of R’s strengths is the ease of performing statistical calculations without having to perform asignificant amount of data manipulation. Using the examples we just graphed, let’s look at the correlationbetween weight and dose for those abdominal CT scans.

cor.test(weight3,dose3)

Page 12: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

## ## Pearson's product-moment correlation ## ## data: weight3 and dose3 ## t = 30.281, df = 1483, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.5856627 0.6485915 ## sample estimates: ## cor ## 0.6181165

So there is definitively a correlation, though it is only of medium strength.

Chi-squareA Chi-square calculation can be used to test goodness of fit for observed vs. expectation. Let’s do this forthe gender distribution of the studies in our sample. First, we have to remove the “other” from theunknown TRAUMA study discussed previously. Then run the test and print the results.

genderMod = factor(gender[which(gender != 'O')]) summary(genderMod)

## F M ## 5560 5431

chisq.test(table(genderMod),p=c(0.5,0.5))

## ## Chi-squared test for given probabilities ## ## data: table(genderMod) ## X-squared = 1.5141, df = 1, p-value = 0.2185

Here we assume that the gender breakdown should be equal at 50% each. Although there is a slightdifference in raw numbers, the p-value is well within the expected variability using a cutoff of alpha < 0.05.

ANOVAFinally, once we have the data properly formatted, R allows an easy ANOVA calculation. Let’s compare thedose of a regular enhanced CT of the abdomen/pelvis to CT of the abdomen/pelvis with and withoutcontrast. We also want to see if gender or weight makes a difference. First, a little data manipulation toextract and build a new data frame. Note that the factor command just tells R to remove unusedcategories, carried over from the original factor, without any members.

Page 13: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

code = mydata[["CodeValue"]] codeANOVA = factor(code[which((code == "CTAPE" | code == "CTAPC") & weight > 10)])genderANOVA = factor(gender[which((code == "CTAPE" | code == "CTAPC") & weight > 10)]) doseANOVA = dose[which((code == "CTAPE" | code == "CTAPC") & weight > 10)] weightANOVA = weight[which((code == "CTAPE" | code == "CTAPC") & weight > 10)] dfANOVA = data.frame(codeANOVA, genderANOVA, weightANOVA, doseANOVA) summary(dfANOVA)

## codeANOVA genderANOVA weightANOVA doseANOVA ## CTAPC: 67 F:548 Min. : 34.02 Min. : 0.945 ## CTAPE:922 M:441 1st Qu.: 63.50 1st Qu.: 8.625 ## Median : 77.11 Median :11.580 ## Mean : 80.15 Mean :13.254 ## 3rd Qu.: 91.63 3rd Qu.:15.615 ## Max. :226.80 Max. :73.650

Now, we can run the ANOVA itself. We have to tell R that we want to use a linear model using the lm()command. Within that command, we specify the dependent data first, then what factors act on it. We alsosay if we want to test for any interactions. Then, using the anova() function, we can get the results.

resultsANOVA = lm(doseANOVA ~ codeANOVA*weightANOVA, data = dfANOVA) anova(resultsANOVA)

## Analysis of Variance Table ## ## Response: doseANOVA ## Df Sum Sq Mean Sq F value Pr(>F) ## codeANOVA 1 3066.4 3066.4 101.436 < 2.2e-16 *** ## weightANOVA 1 20280.9 20280.9 670.896 < 2.2e-16 *** ## codeANOVA:weightANOVA 1 489.8 489.8 16.201 6.132e-05 *** ## Residuals 985 29776.1 30.2 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we expect, there is a difference between a CT performed with and without contrast compared to oneonly with contrast. We also see a difference by weight. However, the interaction term of weight and code isalso significant. This shows that there is some non-linear (in this case quadratic) relationship, capturedonly by this interaction.

Loading packagesOne of the major strengths of the R platform is the ability to easily install external packages to leveragefunctions already written by other users. R provides an easy interface to download and install the externalpackages from a central repository. For example, let’s install an extension to the graphing functions calledggplot2.

Page 14: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

install.packages("ggplot2")

The library function makes sure that it is loaded.

library(ggplot2)

In R studio, we can see which packages are installed and loaded by going to the package tab and makingsure the ggplot2 package is listed and checked.

Now, let’s see how ggplot2 adds both power and simplicity to creating graphs. Above, we created ahistogram for weight3. With ggplot2, we can both create this histogram, and modifications, with muchmore simple functions.

When we use ggplot2, we give it the entire dataset and just indicate which variable (or subset) we areinterested in using. For the simple histogram, use the subset of our whole dataset where the weight isabove 10 kg

g = ggplot(subset(mydata,PatientWeight > 10), aes(x = PatientWeight)) g + geom_histogram()

If we are interested in the density instead, we can just modify the last line

Page 15: with R Introduction to Statistical Computing...Introduction to Statistical Computing with R June 30, 2016 Joe Wildenberg MD/PhD Eric Schmitt MD/PhD Tessa Cook MD/PhD Welcome to the

g + geom_density()