bio503: lecture 5 harvard school of public health wintersession 2009 jess mar department of...
Post on 20-Dec-2015
218 views
TRANSCRIPT
BIO503: Lecture 5
Harvard School of Public Health Wintersession 2009
Jess Mar
Department of Biostatistics
Roadmap for Today
Some More Advanced Statistical Models Multiple Linear Regression Generalized linear models
– Logistic Regression
– Poisson Regression
– Survival Analysis
Multivariate Data Analysis
Programming Tutorials
Bits & Pieces
Tutorial 4
Multiple Linear Regression
Some handy functions to know about:
new.model <- update(old.model, new.formula)
Model Selection functions available in the MASS package
drop1, dropterm
add1, addterm
step, stepAIC
Similarly,
anova(modObj, test="Chisq")
Generalized Linear Models
Linear regression models hinge on the assumption that the response variable follows a Normal distribution.
Generalized linear models are able to handle non-Normal response variables and transformations to linearity.
Logistic Regression
When faced with a binary response Y = (0,1), we use logistic regression.
),|1( xiii YP
T
ip
i
i
x
x
x
1
T
p
i
1
where
jijj
T
ii
i
ii
iix
YP
YPxx
x
1log
),|0(
),|1(log
jijj
jijj
i
x
x
exp1
exp
Problem 2 – Logistic Regression
Read in the anaesthetic data set, data file: anaesthetic.txt.
Covariates:
move binary numeric vector for patient movement
(1 = movement, 0 = no movement)
conc anaethestic concentration
Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.
Fit the Logistic Regression Model
> anes.logit <- glm(nomove ~ conc, family=binomial(link=logit), data=anesthetic)
The output summary looks like this: > summary(anes.logit)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.469 2.418 -2.675 0.00748 **conc 5.567 2.044 2.724 0.00645 **
Estimates of P(Y=1) are given by: > fitted.values(anes.logit)
Estimating Log Odds Ratio
To get back the log odds ratio
> anes.logit$linear.predictors
> plot(anesthetic$conc, anes.logit$linear.predictors)
> abline(coefficients(anes.logit))
Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.
Problem 3 – Multiple Logistic RegressionRead in data set birthwt.txt.
low indicator of birth weight less than 2.5kg age mother's age in years lwt mother's weight in pounds at last menstrual period race mother's race (1 = white, 2 = black, 3 = other) smoke smoking status during pregnancy ptl number of previous premature labours ht history of hypertension ui presence of uterine irritability ftv number of physician visits during the first trimester bwt birth weight in grams
We fit a logistic regression using the glm function and using the binomial family.
Problem 4 - Poisson Regression
Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease.
Example: schooldata.csv.
We can fit the Poisson regression model using the glm function and the poisson family.
Survival Analysis
library(survival)
Example: aml leukemia data
Kaplan-Meier curve
fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11]))
summary(fit1)
plot(fit1)
Log-rank test
survdiff(Surv(time, status)~x, data=aml)
Survival Analysis
Fit a Cox proportional hazards model
coxfit1 <- coxph(Surv(time, status)~x, data=aml)
summary(coxfit1)
Cumulative baseline hazard estimator:
basehaz(coxph(Surv(time, status)~x, data=aml))
Survival function for one group:
plot(survfit(coxfit1, newdata=data.frame(x=1)))
Tutorial 5
Cluster Analysis
Hierarchical Methods:
(Agglomerative, Divisive) + (Single, Average, Complete) Linkage…
Model-based Methods:
Mixed models. Plaid models. Mixture models…
A clustering problem is generally much harder than a classification problem because we don’t know the number of classes.
Clustering observations on the basis of experiments or across a time series.
Clustering experiments together on the basis of observations.
Examples of Clustering Algorithms Available in R
EGEGG
EG
E
NNNNN
NN
N
xxx
x
x
xxx
E
1,1
,1
21
11211
Experiments or Microarray Slides
Genes
EGEGG
EG
E
NNNNN
NN
N
xxx
x
x
xxx
E
1,1
,1
21
11211
Experiments or Microarray Slides
Genes
EGEGG
EG
E
NNNNN
NN
N
xxx
x
x
xxx
E
1,1
,1
21
11211
Experiments or Microarray Slides
Genes
Hierarchical Methods:
hclust
agnes
Partitioning Methods:
som
kmeans
pam
Packages:
cluster
Different Samples
Ob
servation
s
Hierarchical Clustering
n genes in n clusters
n genes in 1 cluster
divisive
agg
lom
erat
ive
We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’.
Euclidean distance
(Pearson) correlation
Source: J-Express Manual
Single linkage
Complete linkage
Average linkage
Different Ways to Determine Distances Between Clusters
Partitioning Methods
Examples of partitioning methods are k-means, partitioning about medoids (pam).
Gap statistic:
source("http://www.bioconductor.org/biocLite.R")
biocLite("SAGx")
?gap
The goal is to minimize the gap statistic.
W – within variance
B – between variance
K-means Clustering
Reference: J-Express manual
241 genes from 19 cell samples into 6 clusters.
Classification (Machine Learning)
Machine learning algorithms predict new classes based on patterns discerned from existing data.
Classification algorithms are a form of supervised learning.
Clustering algorithms are a form of unsupervised learning.
R Package: class – contains knn, SOMnnetMLInterfaces - Biconductor
A simplified way to construct machine learning algorithms from microarray data.
Goal: derive a rule (classifier) that assigns a new object (e.g. patient
microarray profile) to a pre-specified group (e.g. aggressive vs non-
aggressive prostate cancer).
Classification
Linear Discriminant Analysis lda
Support Vector Machines library(e1071) svm
K-nearest neighborsknn
Tree-based methods:rpartrandomForest
Scaling Methods
Principal Component Analysis
prcomp
Multi-dimensional Scaling
MDS
Self Organizing Maps
SOM
Independent Component Analysis
fastICA
R Shortcuts
Ctrl + A:
Ctrl + E:
Ctrl + K
Esc
{Up, Down} Arrow
Laundry List
.Rprofile file
Outline of R packages
Graphics – lattice, Rwiki
Homework
R/SAS/Stata Comparison
Exercises