agenda
Post on 09-Jan-2016
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
Agenda
1. Biological databases related to microarray1. Gene Ontology2. KEGG3. Biocarta4. Reactome5. MSigDB
2. Pathway enrichment analysis1. GSEA2. GSA3. Ingenuity Pathway Analysis (IPA)
3. Motif finding
1. Databases
Biological pathways and knowledge are very complex:
Is it possible to establish a database? • To systematically structuring and managing the knowledge? • To validate analysis result or be incorporated into analysis?
1.1 Gene Ontology• Ontologies: Controlled vocabularies to describe fuctions of genes.• The database is structured as directed acyclic graphs (DAGs), which
differ from hierarchical trees in that a 'child' (more specialized term) can have many 'parents' (less specialized terms).
Molecular Function Ontology
the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity
Biological Process Ontology
broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
Cellular Component Ontology
subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex
1.1 Gene Ontology
Three major categories in Gene Ontology:
Current term counts: as of April 2, 2005 at 18:00 Pacific time:17708 terms, 93.8% with definitions.
9263 biological_process1496 cellular_component6949 molecular_function
Evidence code:
How is the information collected?
1.1 Gene Ontology
• IC inferred by curator • IDA inferred from direct assay • IEA inferred from electronic annotation • IEP inferred from expression pattern • IGI inferred from genetic interaction• IMP inferred from mutant phenotype • IPI inferred from physical interaction • ISS inferred from sequence or structural similarity • NAS non-traceable author statement • ND no biological data available • RCA inferred from reviewed computational analysis • TAS traceable author statement • NR not recorded
There may be (a lot of) errors in the database!!
1.1 Gene Ontology
Demo:
• Go to GO: http://www.geneontology.org • Go to “Tools" and click on "AmiGO". • Click “Browse”. Click on the boxes with "+" to expand any category to
look at its subcategories. Click on "-" to collapse again. • Type the term “cell cycle" in the "Search GO"field. Press "Submit". You
will then see all GO categories containig this word. • Click on a GO term, say “cell cycle arrest”. Genes belonging to this GO
term can be shown. Further filter genes by “Data source” or “Species”.• Type the name “cyclin" in Amigo. Change to the “genes or proteins"
selection button and press "Submit". You will then see a number of genes containing this name. Press some of the "Tree view" links.
• Note that in some cases, the same term category can exist in different places in the tree. This ontology is thus not strictly hierarchical, but shows complex "many-to-many" relationships between gene products, ontology terms and branches in the ontology tree.
http://www.genome.jp/kegg/pathway.html
1.2 KEGG
1.2 KEGGKyoto Encyclopedia of Genes and Genomes
KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks in biological processes (PATHWAY database), the information about the universe of genes and proteins (GENES/SSDB/KO databases), and the information about the universe of chemical compounds and reactions (COMPOUND/GLYCAN/REACTION databases).
The current statistics of KEGG databases is as follows:
Number of pathways 23,574(PATHWAY database)Number of reference pathways 265(PATHWAY database)Number of ortholog tables 87(PATHWAY database)Number of organisms 272(GENOME database)Number of genes 911,584(GENES database)Number of ortholog clusters 35,456(SSDB database)Number of KO assignments 6,221(KO database)Number of chemical compounds 12,737(COMPOUND database)Number of glycans 11,017(GLYCAN database)Number of chemical reactions 6,399(REACTION database)Number of reactant pairs 5,953(RPAIR database)
1.2 KEGGRNA polymerase:
1.2 KEGGCell cycle:
1.2 KEGGParkinson’s disease:
Alzheimer’s disease, Huntington’s disease, Prion disease….
1.3 Biocarta
1.4 Reactome• A manually curated and peer-reviewed (authors,
reviewers and editors) pathway database.• Now annotates 5849 proteins, 4555 complexes, 4827
reactions and 1192 pathways in Homo Sapien (Version 39, 2/21/2012)
# of pathways (gene sets)
Accuracy (manually curated?)
Include gene-gene interactions(network graphs)?
Note
Gene Ontology 17708 gene sets (2005)
No (include many computational predictions)
No
KEGG 415 pathways, 951 diseases
Yes Yes
Biocarta 250 pathways, 4000 proteins, 800 complexes and 3000 interactions
Yes Yes Cancer focused
Reactome 1192 pathways (human)
Yes Yes
NIC-Nature Pathway Interaction Database (PID)
59 pathways Yes Yes Curated by Nature editorial team
1.5 MSigDB
A comprehensive pathway database (mainly gene sets without graphical interaction model). Useful for conventional pathway (gene set) enrichment analysis.
C1: Positional gene sets (326)C2: Curated gene sets (3272)
Canonical pathways (880)Biocarta (217)KEGG (186)Reactome (430)
C3: Motif gene sets (836)miRNA targets gene sets (221)TF targets gene sets(615)
C4: Computational gene sets (881)C5: GO gene sets (1454)
2. Enrichment analysis
After1. Selecting DE genes, or
2. Classification, or
3. Clustering
We are usually given a gene list for further investigation.
How do we validate information contained in the gene list by available
biological knowledge?
Cell cycle data: Cells are synchronized and samples taken at various time points (covering 2 cell cycles). 6162 genes are included.G e tting a ho m o g e ne o us p o p ula tio n o f c e lls:
c e ll c yc le
C e lls a t va rio ussta g e s o f c e ll c yc le
Sync hro niza tio n c o nd itio ns:-Te m p e ra ture shift to 37 C fo r C DC 15 ye a st ts-stra in-a d d p he ro m o ne-Elutria tio n
Re le a se b a c k into c e ll c yc le
Ta ke sa m p lea s c e lls p ro g re ssthro ug h c yc lesim ulta ne o usly
From Fourier analysis, 800 genes with cyclic gene expression pattern are selected for further investigation.
Are these 800 genes really involved in cell cycle?
2. Enrichment analysis
http://db.yeastgenome.org/cgi-bin/GO/goTermMapper
2. Enrichment analysis
Related to cell cycle
Annotated but not related to
cell cycle
Not annotated Total
All genes 385 5703 74 6162
Expression with cyclic
pattern
100 691 9 800
Is the selected set of genes enriched in the GO term of “cell cycle”?
2. Enrichment analysis
Related to cell cycle
Annotated but not related to
cell cycle
Total
Other genes 285 5012 5297
Expression with cyclic
pattern
100 691 791
Total 385 5703 6088
%64.12791
100
???~ %38.5
5297
285
2. Enrichment analysis
Related to cell cycle
Annotated but not related to
cell cycle
Total
Other genes N11 N12 N1
Expression with cyclic
pattern
N21 N22 N2
Total N1 N2 N
. from sampled Data 22211211 p,p,p,p
2. Enrichment analysis
2122
ˆ
ˆ
2
2
1221
0
~
expected
)expectedobserved(
ˆˆˆ,ˆ,ˆ
,hypothesis nullunder
)rly (particula :
ncy)(independe :
2
2
ijN
NN
N
NNN
ij pN
pNN
ij
ijij
jijiij
jj
ii
jiijA
jiij
ji
jiij
ij
ijij
N
NNppp
N
Np
N
Np
ppppppH
pppH
2. Enrichment analysis
Related to cell cycle
Annotated but not related to cell
cycle
Total
Other genes 285 5012 5297Expression with
cyclic pattern100 691 791
Total 385 5703 6088
2644.616088
57037916088
5703791691
6088
3857916088
385791100
6088
570352976088
570352975012
6088
38552976088
3855297285
2222
R code for chi-square test without continuity correction> chisq.test(matrix(c(285, 5012, 100, 691), 2, 2), correct=F)
Pearson's Chi-squared test
data: matrix(c(285, 5012, 100, 691), 2, 2) X-squared = 61.2644, df = 1, p-value = 4.99e-15
2. Enrichment analysis
Fisher’s exact test:G genes in the genome (G=1663) are analyzed; Functional category “F”. In a cluster of size C, h genes are found to be in a functional category “F” with m genes, then p-value (i.e. the probability of observing h or more annotated genes in the cluster is calculated as (Tavazoie et al. 1999):
1
0
1][h
i
m
Gim
CG
i
C
hXP
Chi-squared test is an approximate test and may not perform well when sample size small. Fisher’s exact test is a better alternative.
2. Enrichment analysis
Inside cluster Outside cluster Total
Inside pathway F h m-h m
Outside pathway F C-h G-m-C+h G-m
Total C G-C G
C G C G
h m h m
If genes are randomly assigned, the probability of having h intersection genes is
1
0
1][h
i
m
Gim
CG
i
C
hXP
The p-value is the probability to observe h or more intersection genes by chance:
2. Enrichment analysis
Fisher’s exact test
• There are only two possibilities to observe more extremely than observation:
Inside cluster Outside cluster Total
Inside pathway F 39 1 40
Outside pathway F 161 1799 1960
Total 200 1800 2000
Total
39 1 40
161 1799 1960
Total 200 1800 2000
Total
40 0 40
160 1800 1960
Total 200 1800 2000
200 1800 200 1800
39 1 40 0p-value 0
2000 2000
40 40
Observation:
2. Enrichment analysisFisher’s exact test
2. Enrichment analysis
Kolmogorov-Smirnov test (KS test)
-- A major issue of Fisher’s exact test is that it requires an ad hoc threshold to generate DE gene list.
-- KS test is a better way to associate any gene order with a pathway information.
Example: S1=(1,2,3,5), S2=(4,6,8,9,10)
D=maxx |F1(x)-F2(x)|
In practice, we need to search through thousands of GO terms to determine which GO term is enriched in the selected gene set .
Multiple comparison problem!!
Difficulties: Tests are highly dependent.
1.Hierarchical structure of the GOe.g. “Cell Proliferation” is a parent GO term of “Cell
Cycle”.2.Each gene can belong to multiple GO terms.
e.g. human HoxA7 gene belongs to four GO terms: “Development”, “Nucleus”, “DNA dependent regulation and transcription”, “Transcription factor activity”.
2. Enrichment analysis
2. Enrichment analysisSimple and Naïve way:1.Get p-values from Fisher’s exact test for all pathways.2.Correct by Benjamini-Hochberg procedure to control FDR.
Problem:1.Fisher’s test simplify DE statistics into a biomarker list (0-1).2.Does not consider gene dependence structure and pathway hierarchical dependence structure.
Improved methods:1.Use averaged t-statistics or Kolmogorov-Smirnov (KS) statistics as the pathway-specific enrichment score.2.Apply permutation test (either gene permutation or sample permutation) to perform FDR control.3.Read the following papers if interested.
Goeman, J.J. and Buhlmann, P. (2007) Analyzing gene expression data in terms of gene sets: methodological issues, Bioinformatics, 23, 980-987.Tian, L., Greenberg, S.A., Kong, S.W., Altschuler, J., Kohane, I.S. and Park, P.J. (2005) Discovering statistically significant pathways in expression profiling studies, Proceedings of the National Academy of Sciences of the United States of America, 102, 13544-13549.Efron, B. and Tibshirani, R. (2007) On testing the significance of sets of genes, Annals of Applied Statistics, 1, 107-129.Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S. and Mesirov, J.P. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences of the United States of America, 102, 15545-15550.
Simple Fisher’s exact test: •Ingenuity Pathway
A commercial package with good interface and human curated annotation. Can generate network figures.
• NIH DAVIDFree and web-based. Perform enrichment analysis (Fisher’s exact test), adjust for multiple comparison and generate a table of results. Use multiple databases.
• “Gostats” package in BioconductorFree and web-based. Perform enrichment analysis (Fisher’s exact test) and generate a table of results. Use only GO database.
More sophisticated and systematic methods:• Gene set enrichment analysis (GSEA; MIT Mesirov’s group) http://www.broad.mit.edu/gsea/ (free)• Gene set analysis (GSA; Stanford Tibshirani’s group) http://www-stat.stanford.edu/~tibs/GSA/ (free)• Ingenuity Pathway Analysis (IPA) http://www.ingenuity.com/ (commercial; Pitt has purchases licenses)
2. Enrichment analysis
Things to note when using biological database:
1. Biological pathways and gene functions are complex and difficult to quantify.
2. Data may not be accurate. The analysis should take into account of strength of evidence.
3. May need to go to specific database for particular organism. (e.g. SGD for yeast; FlyBase and BDGP for fly)
4. To systematically collect and manage massive biological knowledge from publications and experiments is an important and active research topic in bioinformatics.
2. Enrichment analysis
3. Motif Finding
3. Motif Finding
http://web.indstate.edu/thcme/mwking/gene-regulation.html
Factor Sequence Motif Comments
c-Myc and Max CACGTG c-Myc first identified as retroviral oncogene; Max specifically associates with c-Myc in cells
c-Fos and c-Jun TGAC/GTC/AA both first identified as retroviral oncogenes; associate in cells, also known as the factor AP-1
CREB TGACGC/TC/A
G/A
binds to the cAMP response element; family of at least 10 factors resulting from different genes or alternative splicing; can form dimers with c-Jun
c-ErbA; also TR (thyroid hormone receptor)
GTGTCAAAGGTCAfirst identified as retroviral oncogene; member of the steroid/thyroid hormone receptor superfamily; binds thyroid hormone
c-Ets G/CA/CGGAA/TG
T/Cfirst identified as retroviral oncogene; predominates in B- and T-cells
GATA T/AGATA family of erythroid cell-specific factors, GATA-1 to -6
c-Myb T/CAACG/TGfirst identified as retroviral oncogene; hematopoietic cell-specific factor
MyoD CAACTGAC controls muscle differentiation
NF-(kappa)B and c-Rel GGGAA/CTNT/CCC(1) both factors identified independently; c-Rel first identified as retroviral oncogene; predominate in B- and T-cells
RAR (retinoic acid receptor)
ACGTCATGACCT binds to elements termed RAREs (retinoic acid response elements) also binds to c-Jun/c-Fos site
SRF (serum response factor)
GGATGTCCATATTAGGACATCT
exists in many genes that are inducible by the growth factors present in serum
3. Motif Finding
http://web.indstate.edu/thcme/mwking/gene-regulation.html
• Genes in a cluster have similar expression patterns.
• They might share common regulatory motifs so they are expressed simultaneously.
• It is of interest to find motifs from the gene clusters.
3. Motif Finding
The following materials are obtained from Shirley Liu at Harvard.
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
3. Motif Finding
top related