__________________________________________________________________________________________________
10/16/2015 GCBA 815
Tools and Algorithms in Bioinformatics
GCBA815, Fall 2015
Week-8: WebGestalt, DAVID, Gene Set Enrichment Analysis (GSEA)
Simarjeet K. Negi, Ph.D. candidate
(Guda Lab)
Department of Genetics, Cell Biology and Anatomy
University of Nebraska Medical Center
__________________________________________________________________________________________________
10/16/2015 GCBA 815
Why perform enrichment analysis?
• Large gene lists resulting from high- throughput analysis
• Deciphering the biology
• Organize expression changes into meaningful functional themes
• Gene enrichment analysis increases the likelihood to identify
molecular processes/functions most pertinent to the study
__________________________________________________________________________________________________
10/16/2015 GCBA 815
• If a biological process is abnormal in a given study, the co-functioning
genes should have a higher (enriched) potential to be selected as a
relevant group by the high-throughput screening technologies
• Analytic conclusion is based on a group of relevant genes that increases
the likelihood to identify the biological processes most pertinent to study
• Enrichment tools map a large number of ‘interesting’ genes to
biological annotation terms (e.g. GO Terms or Pathways)
• Statistical examination of the enrichment of user genes for each of the
annotation terms by comparing the outcome to the control (or reference)
background
Principle of Enrichment Analysis
__________________________________________________________________________________________________
10/16/2015 GCBA 815
• Based on the difference of algorithms, the current enrichment tools can be
broadly divided into three classes:
• Singular enrichment analysis (SEA); WebGestalt
• Gene set enrichment analysis (GSEA); GSEA
• Modular enrichment analysis (MEA); DAVID
• Note, some tools with diverse capabilities belong to more than one class
Classification of Enrichment Tools
Overrepresentation approaches
Aggregate score approach
__________________________________________________________________________________________________
10/16/2015 GCBA 815
WebGestalt : WEB-based Gene SeT AnaLysis Toolkit
(http://bioinfo.vanderbilt.edu/webgestalt/)
__________________________________________________________________________________________________
10/16/2015 GCBA 815
• Input: user’s preselected (e.g. differentially expressed genes selected
between experimental versus control samples) ‘interesting’ genes
• Iteratively testing the enrichment of each annotation term one-by-one in
a linear mode
• Integrates functional enrichment analysis with information
visualization
• Constantly updated
• Efficiently processes large gene lists
• Weakness: output of terms can be large, thereby diluting the focus and
interrelationships of relevant terms
WebGestalt :WEB-based Gene SeT AnaLysis Toolkit
__________________________________________________________________________________________________
10/16/2015 GCBA 815
DAVID: Database for Annotation, Visualization and Integrated
Discovery (https://david.ncifcrf.gov/home.jsp)
__________________________________________________________________________________________________
10/16/2015 GCBA 815
DAVID: Database for Annotation, Visualization and Integrated
Discovery
__________________________________________________________________________________________________
10/16/2015 GCBA 815
• DAVID inherits the basic enrichment calculation as found in
WebGestalt
• Input: user defined gene list
• Incorporates extra network discovery algorithms by considering the
term-to-term relationships
• Improve discovery sensitivity and specificity by considering inter-
relationships of GO terms in the enrichment calculations
• Joint terms may contain unique biological meaning for a given study, not
held by individual terms
• Weakness: Not updated in the recent years, user input gene list size limited to
3000 genes
DAVID: Database for Annotation, Visualization and Integrated
Discovery
__________________________________________________________________________________________________
10/16/2015 GCBA 815
GSEA: Gene Set Enrichment Analysis
(http://www.broadinstitute.org/gsea/)
• Identifies the enriched pathways/gene sets between two biological states
• The program uses an underlying database (MSigDB) of about 11,000 gene sets
that include KEGG, BIOCARTA pathways, curated sets from disease states, etc.
__________________________________________________________________________________________________
10/16/2015 GCBA 815
Seven Broader Collections of GSEA
• Search
• Browse
• Examine gene sets
• Investigate
• Download
Using MSigDB
__________________________________________________________________________________________________
10/16/2015 GCBA 815
GSEA: Gene Set Enrichment Analysis
• GSEA program (download to your PC)
• Input: Expression dataset (between two conditions); Phenotype labels between two states; Gene
sets in gmx/gmt format (MSigDB - supplied by GSEA)
• GSEA implements a ‘no-cutoff’ strategy, taking all genes from a microarray
experiment without selecting significant genes (e.g. genes with P-value 0.05
and fold change 2)
• GSEA method requires a summarized biological value (e.g. fold change)
• Weakness:
• Sometimes, it is a difficult task to summarize many biological aspects of a gene into one
meaningful value; example: SNP arrays, clinical microarray studies
• GSEA is less powerful to detect a gene set with a mix of genes with positive and negative
associations with the phenotype
__________________________________________________________________________________________________
10/16/2015 GCBA 815
Tutorial
__________________________________________________________________________________________________
10/16/2015 GCBA 815
• 487 colorectal cancer prognosis genes downloaded from Shi et al. 2012
• 11521 genes as the reference gene set from the protein-protein interaction
network used in the same paper
• Genes are from a human study
WebGestalt : example dataset
__________________________________________________________________________________________________
10/16/2015 GCBA 815
WebGestalt : WEB-based Gene SeT AnaLysis Toolkit
(http://bioinfo.vanderbilt.edu/webgestalt/)
hsapiens
hsapiens_gene_symbol
Colorectal_cancer_genes
__________________________________________________________________________________________________
10/16/2015 GCBA 815
PPI_network
hsapiens_gene_symbol
__________________________________________________________________________________________________
10/16/2015 GCBA 815
GO Analysis
nodes with red
label represents
enriched categories
and black label
represents their
non-enriched
parents
__________________________________________________________________________________________________
10/16/2015 GCBA 815
KEGG Analysis
Genes highlighted in red in
the pathway map are enriched
in the user input
__________________________________________________________________________________________________
10/16/2015 GCBA 815
• 408 genes involved in the cellular responses to HIV envelope protein
infection in resting or suboptimally activated peripheral blood mononuclear
cells; Cicala et al. 2002
• Affymetrix U95A microarray chip (genome wide expression) as the
reference gene set
DAVID : example dataset
__________________________________________________________________________________________________
10/16/2015 GCBA 815
DAVID: Database for Annotation, Visualization and Integrated
Discovery (https://david.ncifcrf.gov/home.jsp)
__________________________________________________________________________________________________
10/16/2015 GCBA 815
HIV_genes
When multiple species pop up,
click on the species of interest
and press ‘Select Species’
If multiple gene lists are open
in the program, select the
gene list of interest and click
on ‘Use’
1
2
3
__________________________________________________________________________________________________
10/16/2015 GCBA 815
Percentage, e.g.
33/398 (involved
genes/total genes)
__________________________________________________________________________________________________
10/16/2015 GCBA 815
__________________________________________________________________________________________________
10/16/2015 GCBA 815
KEGG Pathway
BIOCARTA
List genes are shown in red stars
__________________________________________________________________________________________________
10/16/2015 GCBA 815
__________________________________________________________________________________________________
10/16/2015 GCBA 815
Table Report is a gene-centric view which lists the genes and their associated annotation
terms (selected only). There is no statistics applied in this report
__________________________________________________________________________________________________
10/16/2015 GCBA 815
User input genes classified into
big gene functional groups
Measure of the importance of a
gene group in the user’s gene list
Key biology of
this gene group
Check if there are any other
genes in the gene list or in the
genome functionally similar
to this gene group
How the members share
common annotations/biology
__________________________________________________________________________________________________
10/16/2015 GCBA 815
GSEA dataset
• Transcriptional profiles from p53+ and p53 mutant cancer cell lines
• Expression datasets: P53_hgu95av2.gct, P53_collapsed.gct
'Collapsed' refers to datasets whose identifiers (i.e Affymetrix probe
set ids) have been replaced with symbols
• Phenotype labels (e.g tumor vs normal): P53.cls
• Gene set: c1.v2.symbols.gmt
__________________________________________________________________________________________________
10/16/2015 GCBA 815
http://www.broadinstitute.org/gsea/datasets.jsp
GSEA: Gene Set Enrichment Analysis
(http://www.broadinstitute.org/gsea/)
__________________________________________________________________________________________________
10/16/2015 GCBA 815
GCT file format; expression data file
__________________________________________________________________________________________________
10/16/2015 GCBA 815
CLS file format; phenotype file
__________________________________________________________________________________________________
10/16/2015 GCBA 815
GMT file format; gene sets
__________________________________________________________________________________________________
10/16/2015 GCBA 815
1
2
3
GSEA: Gene Set Enrichment Analysis
__________________________________________________________________________________________________
10/16/2015 GCBA 815
2
__________________________________________________________________________________________________
10/16/2015 GCBA 815
ftp.broad.mit.edu://pub/gsea/annotations/HG_U95Av2.chip
1
3
2
__________________________________________________________________________________________________
10/16/2015 GCBA 815
Interpreting GSEA Results
GSEA Statistics
GSEA computes four key statistics for the gene set enrichment analysis report:
● Enrichment Score (ES)
● Normalized Enrichment Score (NES)
● False Discovery Rate (FDR)
● Nominal P Value
__________________________________________________________________________________________________
10/16/2015 GCBA 815
Enrichment plot ; Enrichment Score (ES)
• The ES is the maximum deviation from zero encountered in walking the list
• A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES
indicates gene set enrichment at the bottom of the ranked list
• Enrichment score (ES), reflects the degree to
which a gene set is overrepresented at the
top or bottom of a ranked list of genes
• GSEA calculates the ES by walking down
the ranked list of genes, increasing a
running-sum statistic when a gene is in the
gene set and decreasing it when it is not
• The magnitude of the increment depends on
the correlation of the gene with the
phenotype
__________________________________________________________________________________________________
10/16/2015 GCBA 815
GSEA Report
__________________________________________________________________________________________________
10/16/2015 GCBA 815
• To identify the subset of genes that actually contribute to the enrichment score (ES)
• The leading edge subset in a geneset are those genes that appear in the ranked list at or before
the point at which the running sum reaches its maximum
• Outputs heatmaps and set-to-set overlaps of leading edge subsets between pairs enriched
genesets
1
23
__________________________________________________________________________________________________
10/16/2015 GCBA 815
Interpreting Leading Edge Analysis Results
HeatMap
Gene in Subsets Histogram
Set-to-Set
__________________________________________________________________________________________________
10/16/2015 GCBA 815
Heat map shows the (clustered) genes in the leading edge subsets. The
expression values are represented as colors, where the range of colors (red, pink,
light blue, dark blue) shows the range of expression values (high, moderate, low,
lowest)
Set-to-Set graph uses color intensity to show the overlap between subsets: the
darker the color, the greater the overlap between the subsets
Gene in subsets graph shows each gene and the number of subsets in which it
appears
Histogram; the Jacquard is the intersection divided by the union for a pair of
leading edge subsets. Number of Occurrences is the number of leading edge
subset pairs in a particular bin. In this example, most subset pairs have no overlap
(Jacquard = 0)
Interpreting Leading Edge Analysis Results