gene set enrichment analysis · the score is calculated by walking down the list l, increasing a...
TRANSCRIPT
![Page 1: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/1.jpg)
Gene Set Enrichment Analysis
Dr. Vered Caspi
Head, Bioinformatics Core Facility
Ben-Gurion University of the Negev
Advanced Bioinformatics Course, Weizmann Institute of Science
April 14th , 2010
![Page 2: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/2.jpg)
Gene Expression Matrix
![Page 3: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/3.jpg)
Astrocyte
Normal
DownSynd.
Type
Tissue
01 02
Cerebellum Cerebrum
17 18 19
03 04 0506 07 08 09
10 11 12
Slide by Vered Caspi
Usually, pairwise comparisons between groups of samples are performed. e.g. sick vs. healthy, or sick cerebellum vs. healthy
cerebellum, etc.
Heכ"בדart
13 14
24 2515 16 20 21 22 23
![Page 4: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/4.jpg)
This is done by a statistical test, e.g. ANOVA, and yields fold of change and significance p-value for each pairwise comparison (also called “contrast”).
![Page 5: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/5.jpg)
![Page 6: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/6.jpg)
List No. of genes
Down - Normal 196
Cerebrum: Down - Normal 193
Cerebellum: Down - Normal 564
Astrocyte: Down - Normal 296
Heart: Down - Normal 269
Generation of gene lists
Cutoff: p-value < 0.01 AND |fold change| > 1.3
No. of genes in any of the lists
1228
![Page 7: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/7.jpg)
Venn diagram
![Page 8: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/8.jpg)
• If the experiment contains more than two
treatment groups, an alternative way to get
lists of “interesting genes” is clustering
analysis.
![Page 9: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/9.jpg)
Clustering
Partitioning(K-means
Self Organizing MapsClick)
Hierarchical
www.stat.berkeley.edu/~bolstadBen Bolstad, Biostatistics, University of California, Berkeley,
![Page 10: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/10.jpg)
![Page 11: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/11.jpg)
![Page 12: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/12.jpg)
100
1000
No
rma
lize
d e
xp
ressio
n (
log
sca
le)
WT.1 WT.2 WT.3 KO.1 KO.2 KO.3 OE.1 OE.2 OE.3
Genes which show a similar expression pattern across the treatments are calld
“coexpressed genes”
![Page 13: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/13.jpg)
The gene expression pattern may be viewed as line graph (profile), as in the previous
slide, or as a heat map as shown below.
Left – two treatments.
Right – more than two treatments
![Page 14: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/14.jpg)
• Given a group of coexpressed genes, we may now ask: what is the common
biological theme shared among the genes in the group?
• Examples for biological themes:
– Metabolic pathway
– Signal transduction pathway
– Subcellular localization
– Protein complex
– Modulated by the same transcription factor(s)
– Are targets for the same miRNA
– Chromosomal location
– And more…
![Page 16: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/16.jpg)
• Do you expect genes in the same biological
pathway, or same complex, to all change in
the same direction (up or down regulation)
upon a certain treatment?
![Page 17: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/17.jpg)
Response to light intensityAnatomical structure
formation involved in
morphogenesis
Behavior of Arabidopsis genes of two GO terms upon dedifferentiation
![Page 18: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/18.jpg)
100
1000
No
rmal
ized
ex
pre
ssio
n (
log
sca
le)
WT.1 WT.2 WT.3 KO.1 KO.2 KO.3 OE.1 OE.2 OE.3
Reminder: a group of coexpressed genes
![Page 19: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/19.jpg)
The problem: Given a group of genes and a collection of gene sets: find enriched gene sets
![Page 20: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/20.jpg)
Two questions
• How shall the enrichment be calculated?
• Which data set collections shall be used?
![Page 22: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/22.jpg)
Data set collection: MSigDB
![Page 23: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/23.jpg)
![Page 24: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/24.jpg)
Example for a curated gene set:
![Page 25: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/25.jpg)
![Page 26: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/26.jpg)
![Page 27: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/27.jpg)
You may also create your own gene sets…
![Page 28: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/28.jpg)
Enrichment analysis using hypergeometric test: “compute overlaps”
![Page 29: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/29.jpg)
MSigDB Analysis: “compute overlaps”
251 genes
Example
![Page 30: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/30.jpg)
![Page 31: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/31.jpg)
Another example:
![Page 32: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/32.jpg)
Another example
![Page 33: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/33.jpg)
Overlap matrix by gene and gene set
![Page 34: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/34.jpg)
Enrichment analysis
• Hypergeometric test is adequate when we have a
group of genes (out of all genes) for which we
wish to test enrichment of a certain gene set.
• These may be co-expressed genes from microarray
cluster analysis
![Page 35: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/35.jpg)
Hypergeometric test:
genes
![Page 36: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/36.jpg)
The problem: Given a group of genes and a collection of gene sets: find enriched gene sets
![Page 37: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/37.jpg)
David Bioinformatics))
![Page 38: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/38.jpg)
Enrichment analysis considering
continuous data
• But now consider the following case:
![Page 39: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/39.jpg)
![Page 40: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/40.jpg)
Enrichment analysis
• In this case we only compared two treatments, and
defining a “differentially expressed genes” group
requires setting an arbitrary cutoff.
• GSEA allow us to overcome this limitation.
![Page 41: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/41.jpg)
![Page 42: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/42.jpg)
Rank the gene list according to:
Genes’ differential expression with respect to two
phenotypes
Or
Genes’ correlation with a predefined expression
profile
![Page 43: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/43.jpg)
For each gene set (S),
Mark the location of the genes from set S within
the sorted gene list
![Page 44: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/44.jpg)
Plot the running sum for S in the dataset, including the location of
the maximum enrichment score (ES) and the leading-edge subset
![Page 45: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/45.jpg)
Calculation of an Enrichment Score.
We calculate an enrichment score (ES) that reflects the
degree to which a set S is overrepresented at the extremes
(top or bottom) of the entire ranked list L.
The score is calculated by walking down the list L,
increasing a running-sum statistic when we encounter a
gene in S and decreasing it when we encounter genes not in
S.
The magnitude of the increment depends on the correlation
of the gene with the phenotype.
The enrichment score is the maximum deviation from zero
encountered in the random walk; it corresponds to a
weighted Kolmogorov–Smirnov-like statistic
![Page 46: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/46.jpg)
Estimation of Significance Level of ES.
We permute the phenotype labels and recompute the ES of the gene set for the
permuted data, which generates a null distribution for the ES. The empirical,
nominal P value of the observed ES is then calculated relative to this null
distribution. Importantly, the permutation of class labels preserves gene-gene
correlations and, thus, provides a more biologically reasonable assessment of
significance than would be obtained by permuting genes among the gene sets.
Adjustment for Multiple Hypothesis Testing.
We first normalize the ES for each gene set to account for the size of the set,
yielding a normalized enrichment score (NES).
We then control the proportion of false positives by calculating the false discovery
rate (FDR) corresponding to each NES. The FDR is the estimated probability that a
set with a given NES represents a false positive finding; it is computed by
comparing the tails of the observed and null distributions for the NES.
![Page 47: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/47.jpg)
![Page 48: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/48.jpg)
![Page 49: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/49.jpg)
![Page 50: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/50.jpg)
![Page 51: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter](https://reader031.vdocument.in/reader031/viewer/2022011819/5e9be9ffdb61690c3024ccbb/html5/thumbnails/51.jpg)