lecture 19 classification analysis – r code mcb 416a/516a statistical bioinformatics and genomic...
TRANSCRIPT
![Page 1: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/1.jpg)
Lecture 19Classification analysis – R code
MCB 416A/516A Statistical Bioinformatics and Genomic Analysis
Prof. Lingling AnUniv of Arizona
![Page 2: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/2.jpg)
Last time:
Introduction to classification analysis Why gene select Performance assessment Case study
2
![Page 3: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/3.jpg)
ALL data preprocessed gene expression data Chiaretti et al., Blood (2004) 12625 genes (hgu95av2 Affymetrix GeneChip) 128 samples (arrays) phenotypic data on all 128 patients, including:
95 B-cell cancer 33 T-cell cancer
3
![Page 4: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/4.jpg)
Packages needed#### install the required packages for the first time ###
source("http://www.bioconductor.org/biocLite.R")biocLite("ALL")biocLite("Biobase")biocLite("genefilter")biocLite("hgu95av2.db")biocLite("MLInterfaces")install.packages("gplots")install.packages("e1071")
4
![Page 5: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/5.jpg)
Then …#### load the packages ###
library("ALL") ## or without quoteslibrary("Biobase")library("genefilter")library("hgu95av2.db")library("MLInterfaces")library("gplots")library("e1071")
5
![Page 6: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/6.jpg)
Read in data and other practicesdata() ## list all data sets available in loaded packages
data(package="ALL") ## list all data sets in the “ALL” package
data(ALL) ## manually load the specific data into R (R doesn’t automatically load everything to save memory)
?ALL ## description of the ALL dataset
ALL ## data set “ALL”; It is an instance of “ExpressionSet” class
help(ExpressionSet)exprs(ALL)[1:3,] ## get the first three rows of the data
matrix
6
![Page 7: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/7.jpg)
Data manipulationpData(ALL) ## info about the
data/experimentsnames(pData(ALL)) ## column names of the infoOr varLabels(ALL)
ALL.1<-ALL[,order(ALL$mol.bio)] ### order samples by sample types
ALL.1$mol.bio
7
![Page 8: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/8.jpg)
Effect on gene selection by correlation map#### plot the correlation matrix of the 128 samples using all 12625
genes #####
library(gplots)heatmap( cor(exprs(ALL.1)), Rowv=NA, Colv=NA,
scale="none", labRow=ALL.1$mol.bio, labCol= ALL.1$mol.bio, RowSideColors= as.character(as.numeric(ALL.1$mol.bio)), ColSideColors= as.character(as.numeric(ALL.1$mol.bio)), col=greenred(75))
8
![Page 9: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/9.jpg)
9
![Page 10: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/10.jpg)
Simple gene-filtering #### make a simple filtering and select genes with standard
deviation (i.e., sd) larger than 1 #####
ALL.sd<-apply(exprs(ALL.1), 1, sd)ALL.new<-ALL.1 [ALL.sd>1, ]ALL.new #### 379 genes left
10
![Page 11: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/11.jpg)
Heatmap of the selected genesheatmap( cor(exprs(ALL.new)), Rowv=NA, Colv=NA,
scale="none", labRow=ALL.new$mol.bio, labCol= ALL.new$mol.bio, RowSideColors= as.character(as.numeric(ALL.new$mol.bio)), ColSideColors= as.character(as.numeric(ALL.new$mol.bio)),
col=greenred(75))
ALL.new$BT ## In the correlation plot, NEG has two subgroups. One is B-cell (BT=B) and the other is T-cell (BT=T).
## BT: The type and stage of the disease; B indicates B-cell ALL while a T indicates T-cell ALL
11
![Page 12: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/12.jpg)
12
![Page 13: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/13.jpg)
Exploration/illustration of why we need do gene selection is done.
Let’s go back and focus on classification analysis
13
![Page 14: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/14.jpg)
Classification analysisPreprocessing and selection of samples
# back to the ALL data and select all B-cell lymphomas from the BT column:> table(ALL$BT) B B1 B2 B3 B4 T T1 T2 T3 T4 5 19 36 23 12 5 1 15 10 2>bcell <- grep("^B", as.character(ALL$BT))
> table(ALL$mol.biol) ## take a look at the summary of mol.biol ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16 10 37 5 74 1 1
# select BCR/ABL abnormality and negative controls from the mol.biol column:
moltype <- which(as.character(ALL$mol.biol) %in% c("NEG", "BCR/ABL"))
# only keep patients with B-cell lymphomas in BCR/ABL or NEG
ALL_bcrneg <- ALL[, intersect(bcell, moltype)]ALL_bcrneg$mol.biol <- factor(ALL_bcrneg$mol.biol) ## drop unused
levels 14
![Page 15: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/15.jpg)
Select the samples # structure of the new dataset
> head(pData(ALL_bcrneg))> table(ALL_bcrneg$mol.biol)
BCR/ABL NEG 37 42
15
![Page 16: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/16.jpg)
Gene filteringlibrary("hgu95av2.db")
# Apply non-specific filtering
library(genefilter)small <- nsFilter(ALL_bcrneg, var.cutoff=0.75)$esetsmall ## find out how many genes are left?dim(small) ## same thing
16
![Page 17: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/17.jpg)
> table(small$mol.biol)
BCR/ABL NEG 37 42
#... define positive and negative cases
> Negs <- which(small$mol.biol == "NEG")> Bcr <- which(small$mol.biol == "BCR/ABL")
> Negs [1] 2 4 5 6 7 8 11 12 14 19 22 24 26 28 31 35 37 38 39[20] 43 44 45 46 49 50 51 52 54 55 56 57 58 61 62 65 66 67 68[39] 70 74 75 77
> Bcr [1] 1 3 9 10 13 15 16 17 18 20 21 23 25 27 29 30 32 33 34 36 40 41[23] 42 47 48 53 59 60 63 64 69 71 72 73 76 78 79
17
![Page 18: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/18.jpg)
Split data into training and validation set#... randomly sample 20 individuals from each group for training and
put them together as training data
set.seed(7)S1 <- sample(Negs, 20, replace=FALSE)S2 <- sample(Bcr, 20, replace=FALSE)TrainInd <- c(S1, S2)
#... the remainder is for validation
> TestInd <- setdiff(1:79, TrainInd)> TestInd [1] 1 3 4 6 10 11 14 15 17 20 22 23 24 25 27 28 32 33 38 39 40 41 43 44
45 46 50 51 52[30] 53 54 58 61 64 68 71 74 75 79
18
![Page 19: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/19.jpg)
Gene selection further#... perform gene/feature selection on the TRAINING SET
Train=small[,TrainInd]Traintt <- rowttests(Train, "mol.biol")dim(Traintt) ## check the size>head(Traintt) ## and take a look at the top part of the datastatistic dm p.value41654_at 0.9975424 0.22064713 0.324811435430_at 0.4318284 0.10770465 0.668306638924_s_at -0.3245558 -0.06028665 0.747297236023_at 1.1890007 0.19535279 0.2418165266_s_at 0.3070361 0.12969821 0.760492237569_at 1.1079437 0.26620492 0.2748499
19
![Page 20: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/20.jpg)
#... order the genes by abs. value of test statistic
> ordTT <- order(abs(Traintt$statistic), decreasing=TRUE)> head(ordTT)[1] 677 1795 680 1369 1601 684
20
![Page 21: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/21.jpg)
#... select top 50 significant genes/features for simplicity
>featureNamesFromtt <- featureNames(small[ordTT[1:50],])>featureNamesFromtt [1] "1635_at" "37027_at" "40480_s_at" "36795_at" "41071_at" "1249_at"
[7] "38385_at" "33232_at" "41038_at" "32562_at" "37398_at" "37762_at"
[13] "37363_at" "40504_at" "41478_at" "35831_at" "34798_at" "1140_at"
[19] "37021_at" "41193_at" "32434_at" "40167_s_at" "39837_s_at" "35940_at"
[25] "40202_at" "35162_s_at" "33819_at" "39329_at" "39319_at" "31886_at"
[31] "36131_at" "39372_at" "41015_at" "36638_at" "41742_s_at" "38408_at"
[37] "40074_at" "37377_i_at" "37598_at" "37033_s_at" "40196_at" "760_at"
[43] "31898_at" "40051_at" "36591_at" "36008_at" "33206_at" "35763_at"
[49] "32612_at" "33362_at"
#... create a reduced expressionSet with top 50 significant features selected on the TRAINING SET
> esetShort <- small[featureNamesFromtt,]> dim(esetShort)Features Samples 50 79
21
![Page 22: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/22.jpg)
So far we have selected the samples (79) with B-cell type cancer and mol.bio = BCR/AML or NEG
And selected/filtered genes twice: first ,filtering out the genes with small variation across 79 samples Second, selecting the top 50 genes (via two sample t-test) from a
randomly selected training data set (20 NEGs + 20 BCR/AML s)
Data set: esetShort
Now we’ll use these 50 genes as marker genes and build classifiers on training data set, and calculate the classification error rate (i.e., misclassification rate) on test data set.
22
![Page 23: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/23.jpg)
Classification methodslibrary(MLInterfaces)library(help=MLInterfaces)vignette("MLInterfaces")?MLearn
23
![Page 24: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/24.jpg)
Use validation set to estimate predictive accuracy
# -----------K nearest neighbors---------
Knn.out <- MLearn(mol.biol ~ ., data=esetShort, knnI(k=1, l=0), TrainInd)
show(Knn.out)confuMat(Knn.out)
bb=confuMat(Knn.out)Err=(bb[1,2]+bb[2,1])/sum(bb) ## calculate
misclassification rate
24
![Page 25: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/25.jpg)
# ----Linear discriminant analysis-------------
Lda.out <- MLearn(mol.biol ~ ., data=esetShort, ldaI, TrainInd)show(Lda.out)confuMat(Lda.out)
bb=confuMat(Lda.out)Err=(bb[1,2]+bb[2,1])/sum(bb) ## calculate misclassification rate
25
![Page 26: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/26.jpg)
# -----------Support vector machine -------------
library(e1071)Svm.out <- MLearn(mol.biol ~ ., data=esetShort, svmI, TrainInd)show(Svm.out)
bb=confuMat(Svm.out)Err=(bb[1,2]+bb[2,1])/sum(bb) ## calculate misclassification rate
26
![Page 27: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/27.jpg)
# -----------Classification tree------------------
Ct.out <- MLearn(mol.biol ~ ., data=esetShort, rpartI, TrainInd)
show(Ct.out)confuMat(Ct.out)
27
![Page 28: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/28.jpg)
# -----------Logistic regression------------------
# it may be worthwhile using a more specialized implementation of logistic regression
Lr.out <- MLearn(mol.biol ~ ., data=esetShort, glmI.logistic(threshold=.5), TrainInd, family=binomial)
show(Lr.out)confuMat(Lr.out)
28
![Page 29: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/29.jpg)
Cross validationDefine function for random partition:
ranpart = function(K, data) { N = nrow(data) cu = as.numeric(cut(1:N, K)) sample(cu, size = N, replace = FALSE) }
ranPartition = function(K) function(data, clab, iternum) { p = ranpart(K, data) which(p == iternum) }
29
![Page 30: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/30.jpg)
random 5-fold cross-validationset.seed(1)kk=100error=rep(0, kk)
for (i in 1:kk){ r1 <- MLearn(mol.biol ~ ., esetShort, knnI(k = 1, l = 0), xvalSpec("LOG",
5, partitionFunc = ranPartition(5)))bb= confuMat(r1)error[i]=(bb[1,2]+bb[2,1])/sum(bb)
}mean(error) ## get the average of the error/misspecification from
cross-validationboxplot(error)
30
![Page 31: Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona](https://reader034.vdocument.in/reader034/viewer/2022050714/56649e4c5503460f94b42495/html5/thumbnails/31.jpg)
“MLInterfaces” packageThis package is meant to be a unifying platform for all machine learning procedures (including classification and clustering methods). Useful but use of the package easily becomes a black box!!
Separated functions:Linear and quadratic discriminant analysis: “lda” and “qda”KNN classification: “knn”CART: “rpart”Bagging and AdaBoosting: “bagging” and “logitboost”Random forest: “randomForest”Support Vector machines: “svm”Artifical neural network: “nnet”Nearest shrunken centroids: “pamr”
Classification methods available in Bioconductor
31