09/05/2005 סמינריון במתמטיקה ביולוגית dimension reduction - pca principle...
Post on 20-Dec-2015
222 views
TRANSCRIPT
09/05/2005סמינריון במתמטיקה ביולוגית
Dimension Reduction - PCA
Principle Component Analysis
סמינריון במתמטיקה ביולוגית
The Goals
Reduce the number of dimensions of a data set. Capture the maximum information present in
the initial data set. Minimize the error between the original data
set and the reduced dimensional data set.
Simpler visualization of complex data.
סמינריון במתמטיקה ביולוגית
The Algorithm Step 1: Calculate the Covariance Matrix of the
observation matrix.
Step 2: Calculate the eigenvalues and the corresponding eigenvectors.
Step 3: Sort eigenvectors by the magnitude of their eigenvalues.
Step 4: Project the data points on those vectors.
סמינריון במתמטיקה ביולוגית
The Algorithm Step 1: Calculate the Covariance Matrix of
the observation matrix.
Step 2: Calculate the eigenvalues and the corresponding eigenvectors.
Step 3: Sort eigenvectors by the magnitude of their eigenvalues.
Step 4: Project the data points on those vectors.
סמינריון במתמטיקה ביולוגית
PCA – Step 1: Covariance Matrix C
- Data Matrix
Tn
N
nn XXXX
NC )()(
1
1
X
)(..),(
....
....
),(..)(
)(
1
11
nn
n
XVarXXCov
XXCovXVar
XC
סמינריון במתמטיקה ביולוגית
Covariance Matrix - Example Tn
N
nn XXXX
NC )()(
1
1
51131
4822
8741
X
1
2
1
1X
13
2
4
2X
1
8
7
3X
5
4
8
4X
5
4
5
20
16
20
4
1
5
4
8
1
8
7
13
2
4
1
2
1
4
1X
4
2
4
5
4
5
1
2
1ˆ
1X
8
2
1ˆ
2X
4
4
2ˆ
3X
0
0
3ˆ
4X
2460
665.4
05.45.7
96240
242418
01830
4
1
003
442
821
424
0484
0422
3214
4
1C
סמינריון במתמטיקה ביולוגית
The Algorithm Step 1: Calculate the Covariance Matrix of the
observation matrix.
Step 2: Calculate the eigenvalues and the corresponding eigenvectors.
Step 3: Sort eigenvectors by the magnitude of their eigenvalues.
Step 4: Project the data points on those vectors.
סמינריון במתמטיקה ביולוגית
Linear Algebra Review – Eigenvalue and Eigenvector
C - a square nn matrix
0
0)(
0
IC
vIC
IvCv
vCv
Example
eigenvector
vCv
eigenvalue
סמינריון במתמטיקה ביולוגית
Singular Value Decomposition
T
nn
ji
nnn
i
nn
jinn VUC
,
,
,
1
,
,,
TVUC
סמינריון במתמטיקה ביולוגית
SVD ExampleLet us find SVD for the matrix
1) First, compute XTX:
2) Second, find the eigenvalues of XTX and the corresponding eigenvectors:
( use the following formula - )
11
22X
53
35
11
22
12
12XXC T
D
53
35
53
35
0
0
53
35
10
01
0)( xCI
סמינריון במתמטיקה ביולוגית
2
12
1
1V
2
12
1
2V
2
8
0161092510
)3()3()5()5(53
35)det(
2
1
22
D
סמינריון במתמטיקה ביולוגית
SVD Example - Continue3) Now, we obtain the U and Σ :
4) And the decomposition C=UΣVT:
2
0
2
12
1
11
22111 uXv ;2,
1
011
u
0
22
2
12
1
11
22222 uXv ;22,
0
112
u
2
1
2
12
1
2
1
220
02
01
10
11
22
סמינריון במתמטיקה ביולוגית
The Algorithm Step 1: Calculate the Covariance Matrix of the
observation matrix.
Step 2: Calculate the eigenvalues and the corresponding eigenvectors.
Step 3: Sort eigenvectors by the magnitude of their eigenvalues.
Step 4: Project the data points on those vectors.
סמינריון במתמטיקה ביולוגית
PCA – Step 3 Sort eigenvectors by the
magnitude of their eigenvalues
סמינריון במתמטיקה ביולוגית
The Algorithm Step 1: Calculate the Covariance Matrix of the
observation matrix.
Step 2: Calculate the eigenvalues and the corresponding eigenvectors.
Step 3: Sort eigenvectors by the magnitude of their eigenvalues.
Step 4: Project the data points on those vectors.
סמינריון במתמטיקה ביולוגית
PCA – Step 4 Project the input data onto
the principal components.
The new data values are generated for each observation, which are a linear combination as follows:
VXSc
kikpcipciipcpci XbXbXbSc ,,2,2,1,,, .. score observation principal component loading (-1 to 1) variablek
pci
b
Sc
סמינריון במתמטיקה ביולוגית
PCA - Fundamentals
1st PC
2nd PC
Projections
X1
X2
X3
The first PC is the eigenvector with the greatest eigenvalue for the covariance matrix of the dataset. The Eigenvalues are also the variances of the observations in each of the new coordinate axes
Var(PC1)Var(PC2)
סמינריון במתמטיקה ביולוגית
PCA: Scores
x1
x2
x3
Obs. i 1st PC
2nd PC
The scores are the places along the component lines where the observations are projected.
VXSc
2,iSc
1,iSc
pciSc ,
סמינריון במתמטיקה ביולוגית
PCA: Loadings
x1
x2
x3
The loadings bpc,k (dimension a, variable k) indicate the importance of the variable k to the given dimension. bpc,k is the direction cosine (cos a) of the given component line vs. the xk
coordinate axis.
1x1
x2
x3
23
1st PC
Cos(X/PC
סמינריון במתמטיקה ביולוגית
PCA - Summary Multivariate projection technique. Reduce dimensionality of data by transforming
correlated variables into a smaller number of uncorrelated components.
Graphical overview. Plot data in K-Dimensional space. Directions of maximum variation. Best preserves the variance as measured in the high-
dimensional input space. Projection of data onto lower dimensional planes.
09/05/2005סמינריון במתמטיקה ביולוגית
Biological Background
סמינריון במתמטיקה ביולוגית
Reverse
Tran
scriptase
c
סמינריון במתמטיקה ביולוגית
Areas Being Studied With Microarrays
To compare the expression of a protein (gene) between two or more tissues.
To check whether a protein appears in a specific tissue.
To find the difference in gene expression between a normal and a cancerous tissue.
סמינריון במתמטיקה ביולוגית
cDNA Microarray Experiments
Different tissues, same organism (brain v. liver).
Same tissue, different organisms. Same tissue, same organism (tumour v.
non-tumour). Time course experiments.
סמינריון במתמטיקה ביולוגית
Microarray Technology
Method for measuring levels of expression of thousands of genes simultaneously.
There are two types of arrays: cDNA and long oligonucleotide arrays. Short oligonucleotide arrays.
• Each probe is ~25 nucleotide long.• 16-20 probes for each gene.
סמינריון במתמטיקה ביולוגית
The Idea
Target: cDNA (variables to be detected)
Probe: oligos/cDNA(gene templates) +
Hybridization
סמינריון במתמטיקה ביולוגית
Brief Outline of Steps for Producing a Microarray Produce mRNA Hybridise
Complimentary sequence will bind
Fluorescence shows binding
Scan array (Extraction of intensities with picture analysis software)
סמינריון במתמטיקה ביולוגית
Hybridization
RNA is cloned to cDNA with reverse transcriptase.
The cDNA is labeled. Fluorescent labeling is most common, but
radioactive labeling is also used. Labeling may be incorporated in hybridization,
or applied afterwards. Then the labeled samples are hybridized to
the microarrays.
סמינריון במתמטיקה ביולוגית
סמינריון במתמטיקה ביולוגית
Gene Expression Database – a Conceptual View
Gene expression levels
Gene expression matrix
Genes Gene annotations
Sam
ples
Samples annotations
09/05/2005סמינריון במתמטיקה ביולוגית
The Article
סמינריון במתמטיקה ביולוגית
The Biological Problem
The very high dimensional space of gene expression measurements obtained by DNA micro arrays impedes the detection of underlying patterns in gene expression data and the identification of discriminatory genes.
סמינריון במתמטיקה ביולוגית
Why to Use PCA?
To obtain a direct link between patterns in gene and patterns in samples.
Sample annotations
Gene annotations
סמינריון במתמטיקה ביולוגית
The Paper Shows:
Distinct patterns are obtained when the genes are projected an a two-dimensional plane.
After the removal of irrelevant genes, the
scores on the new space showed distinct tissue patterns.
סמינריון במתמטיקה ביולוגית
The Data Used in Experiment
Oligonucleotide microarray measurements of 7070 genes made in 40 normal human tissue samples.
The tissues they used were from brain, kidney, liver, lung, esophagus, skeletal muscle, breast, stomach, colon, blood, spleen, prostate, testes, vulva, proliferative endometrium, myometrium, placenta, cervix, and ovary.
סמינריון במתמטיקה ביולוגית
Results PCA Loadings Can Be Used to Filter
Irrelevant Genes The data from 40 human tissues were first
projected using PCA. The first and second PCs account for 70% of ∼
the information present in the entire data set.
R
ii
r
ii
1
2
1
2
סמינריון במתמטיקה ביולוגית
Gene Selection Based on the Loadings on the Principal Components
Graph A shows the score plot of the samples before any filtering is implemented.
Score Plot of the Tissue Samples
Scores on Principle Component 1
Sco
res
on
Pri
nci
ple
Co
mp
on
ent
2
סמינריון במתמטיקה ביולוגית
Graphs B shows the loading plot of the genes before any filtering is implemented.
Loadings on Principle Component 1
Lo
adin
gs
on
Pri
nci
ple
Co
mp
on
ent
2
Loading Plot of the Genes
סמינריון במתמטיקה ביולוגית
The Filter on Loadings
Graph E displays quantitatively the decisions that went into the choice of the filtering threshold. It displays the distortion in the observed patterns, as measured through the squared difference, and the number of genes retained for analysis as the threshold is varied.
Sq
ua
red
Dif
fere
nc
e
Threshold
Nu
mb
er
of
ge
ne
s
40
1
5
1
2,,,, )(
s pcopcsfpcs yySqDif
סמינריון במתמטיקה ביולוגית
The Filter on the Loadings - Continue
The chosen filter threshold was 0.001.
Filtering reduced the number of genes from 7070 to 425. S
qu
are
d D
iffe
ren
ce
Threshold
Nu
mb
er
of
ge
ne
s
סמינריון במתמטיקה ביולוגית
Graphs C show the score plot after the filtering.
Scores on Principle Component 1
Score Plot of the Tissue Samples
Sco
res
on
Pri
nci
ple
Co
mp
on
ent
2
סמינריון במתמטיקה ביולוגית
Graphs D show the loading plot after the filtering.
Loadings on Principle Component 1
Lo
adin
gs
on
Pri
nci
ple
Co
mp
on
ent
2
Loading Plot of the Genes
סמינריון במתמטיקה ביולוגית
Scores on Principle Component 1
Score Plot of the Tissue Samples
Sco
res
on
Pri
nci
ple
Co
mp
on
ent
2
Score Plot of the Tissue Samples
Scores on Principle Component 1
Sco
res
on
Pri
nci
ple
Co
mp
on
ent
2
Compare ..
Dramatic reduction from the initial 7070 genes to the 425, finally retained, resulted in a minimal information loss relevant to the description of the samples in the reduced space.
סמינריון במתמטיקה ביולוגית
Loadings on Principle Component 1
Lo
adin
gs
on
Pri
nci
ple
Co
mp
on
ent
2
Loading Plot of the Genes
Loadings on Principle Component 1
Lo
adin
gs
on
Pri
nci
ple
Co
mp
on
ent
2
Loading Plot of the Genes
Compare ..
Three linear structures can be identified in the loadingplot of the 425 genes selected by the above analysis.
Each structure comprising a set of genes.
סמינריון במתמטיקה ביולוגית
PCA – Discussion PCA has strong, yet flexible, mathematical
structure. PCA simplifies the “views” of the data. Reduces dimensionality of gene expression
space. The correspondence between the score plot and
the loading plot enables the elimination of redundant variables.
PCA allowed the classification of new samples belonging to the used types of tissues.
סמינריון במתמטיקה ביולוגית
PCA – Discussion (Cont.)
In the article this method facilitated the identification of strong underlying structures in the data. The identification of such structures is uniquely dependent on the data and is not generally guaranteed.
No “correct” way of classification, “biological understanding” is the ultimate guide.
סמינריון במתמטיקה ביולוגית
My Critics
Positives Can deal with large data sets. There weren’t done any assumptions on the
data. This method is general and may be applied to any data set.
Negatives Nonlinear structure is invisible to PCA The meaning of features is lost when linear
combinations are formed
סמינריון במתמטיקה ביולוגית
True covariance matrices are usually not known, estimated from data.
The Graph : First component will be
chosen along the largest variance line => both clusters will strongly overlap.
Projection to orthogonal axis to the first PCA component will give much more discriminating power.
סמינריון במתמטיקה ביולוגית
Thank you !!!Thank you !!!