agenda

Agenda

1. Biological databases related to microarray1. Gene Ontology2. KEGG3. Biocarta4. Reactome5. MSigDB

2. Pathway enrichment analysis1. GSEA2. GSA3. Ingenuity Pathway Analysis (IPA)

3. Motif finding

1. Databases

Biological pathways and knowledge are very complex:

Is it possible to establish a database? • To systematically structuring and managing the knowledge? • To validate analysis result or be incorporated into analysis?

1.1 Gene Ontology• Ontologies: Controlled vocabularies to describe fuctions of genes.• The database is structured as directed acyclic graphs (DAGs), which

differ from hierarchical trees in that a 'child' (more specialized term) can have many 'parents' (less specialized terms).

Molecular Function Ontology

the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity

Biological Process Ontology

broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions

Cellular Component Ontology

subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex

1.1 Gene Ontology

Three major categories in Gene Ontology:

Current term counts: as of April 2, 2005 at 18:00 Pacific time:17708 terms, 93.8% with definitions.

9263 biological_process1496 cellular_component6949 molecular_function

Evidence code:

How is the information collected?

1.1 Gene Ontology

• IC inferred by curator • IDA inferred from direct assay • IEA inferred from electronic annotation • IEP inferred from expression pattern • IGI inferred from genetic interaction• IMP inferred from mutant phenotype • IPI inferred from physical interaction • ISS inferred from sequence or structural similarity • NAS non-traceable author statement • ND no biological data available • RCA inferred from reviewed computational analysis • TAS traceable author statement • NR not recorded

There may be (a lot of) errors in the database!!

1.1 Gene Ontology

• Go to GO: http://www.geneontology.org • Go to “Tools" and click on "AmiGO". • Click “Browse”. Click on the boxes with "+" to expand any category to

look at its subcategories. Click on "-" to collapse again. • Type the term “cell cycle" in the "Search GO"field. Press "Submit". You

will then see all GO categories containig this word. • Click on a GO term, say “cell cycle arrest”. Genes belonging to this GO

term can be shown. Further filter genes by “Data source” or “Species”.• Type the name “cyclin" in Amigo. Change to the “genes or proteins"

selection button and press "Submit". You will then see a number of genes containing this name. Press some of the "Tree view" links.

• Note that in some cases, the same term category can exist in different places in the tree. This ontology is thus not strictly hierarchical, but shows complex "many-to-many" relationships between gene products, ontology terms and branches in the ontology tree.

http://www.genome.jp/kegg/pathway.html

1.2 KEGG

1.2 KEGGKyoto Encyclopedia of Genes and Genomes

KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks in biological processes (PATHWAY database), the information about the universe of genes and proteins (GENES/SSDB/KO databases), and the information about the universe of chemical compounds and reactions (COMPOUND/GLYCAN/REACTION databases).

The current statistics of KEGG databases is as follows:

Number of pathways 23,574(PATHWAY database)Number of reference pathways 265(PATHWAY database)Number of ortholog tables 87(PATHWAY database)Number of organisms 272(GENOME database)Number of genes 911,584(GENES database)Number of ortholog clusters 35,456(SSDB database)Number of KO assignments 6,221(KO database)Number of chemical compounds 12,737(COMPOUND database)Number of glycans 11,017(GLYCAN database)Number of chemical reactions 6,399(REACTION database)Number of reactant pairs 5,953(RPAIR database)

1.2 KEGGRNA polymerase:

1.2 KEGGCell cycle:

1.2 KEGGParkinson’s disease:

Alzheimer’s disease, Huntington’s disease, Prion disease….

1.3 Biocarta

1.4 Reactome• A manually curated and peer-reviewed (authors,

reviewers and editors) pathway database.• Now annotates 5849 proteins, 4555 complexes, 4827

reactions and 1192 pathways in Homo Sapien (Version 39, 2/21/2012)

# of pathways (gene sets)

Accuracy (manually curated?)

Include gene-gene interactions(network graphs)?

Gene Ontology 17708 gene sets (2005)

No (include many computational predictions)

KEGG 415 pathways, 951 diseases

Yes Yes

Biocarta 250 pathways, 4000 proteins, 800 complexes and 3000 interactions

Yes Yes Cancer focused

Reactome 1192 pathways (human)

Yes Yes

NIC-Nature Pathway Interaction Database (PID)

59 pathways Yes Yes Curated by Nature editorial team

1.5 MSigDB

A comprehensive pathway database (mainly gene sets without graphical interaction model). Useful for conventional pathway (gene set) enrichment analysis.

C1: Positional gene sets (326)C2: Curated gene sets (3272)

Canonical pathways (880)Biocarta (217)KEGG (186)Reactome (430)

C3: Motif gene sets (836)miRNA targets gene sets (221)TF targets gene sets(615)

C4: Computational gene sets (881)C5: GO gene sets (1454)

2. Enrichment analysis

After1. Selecting DE genes, or

2. Classification, or

3. Clustering

We are usually given a gene list for further investigation.

How do we validate information contained in the gene list by available

biological knowledge?

Cell cycle data: Cells are synchronized and samples taken at various time points (covering 2 cell cycles). 6162 genes are included.G e tting a ho m o g e ne o us p o p ula tio n o f c e lls:

c e ll c yc le

C e lls a t va rio ussta g e s o f c e ll c yc le

Sync hro niza tio n c o nd itio ns:-Te m p e ra ture shift to 37 C fo r C DC 15 ye a st ts-stra in-a d d p he ro m o ne-Elutria tio n

Re le a se b a c k into c e ll c yc le

Ta ke sa m p lea s c e lls p ro g re ssthro ug h c yc lesim ulta ne o usly

From Fourier analysis, 800 genes with cyclic gene expression pattern are selected for further investigation.

Are these 800 genes really involved in cell cycle?

http://db.yeastgenome.org/cgi-bin/GO/goTermMapper

Related to cell cycle

Annotated but not related to

cell cycle

Not annotated Total

All genes 385 5703 74 6162

Expression with cyclic

pattern

100 691 9 800

Is the selected set of genes enriched in the GO term of “cell cycle”?

cell cycle

Other genes 285 5012 5297

pattern

100 691 791

Total 385 5703 6088

%64.12791

???~ %38.5

cell cycle

Other genes N11 N12 N1

pattern

N21 N22 N2

Total N1 N2 N

. from sampled Data 22211211 p,p,p,p

expected

)expectedobserved(

ˆˆˆ,ˆ,ˆ

,hypothesis nullunder

)rly (particula :

ncy)(independe :

jijiij

ppppppH

Annotated but not related to cell

Other genes 285 5012 5297Expression with

cyclic pattern100 691 791

Total 385 5703 6088

2644.616088

57037916088

5703791691

3857916088

385791100

570352976088

570352975012

38552976088

3855297285

R code for chi-square test without continuity correction> chisq.test(matrix(c(285, 5012, 100, 691), 2, 2), correct=F)

Pearson's Chi-squared test

data: matrix(c(285, 5012, 100, 691), 2, 2) X-squared = 61.2644, df = 1, p-value = 4.99e-15

Fisher’s exact test:G genes in the genome (G=1663) are analyzed; Functional category “F”. In a cluster of size C, h genes are found to be in a functional category “F” with m genes, then p-value (i.e. the probability of observing h or more annotated genes in the cluster is calculated as (Tavazoie et al. 1999):

Chi-squared test is an approximate test and may not perform well when sample size small. Fisher’s exact test is a better alternative.

Inside cluster Outside cluster Total

Inside pathway F h m-h m

Outside pathway F C-h G-m-C+h G-m

Total C G-C G

C G C G

h m h m

If genes are randomly assigned, the probability of having h intersection genes is

The p-value is the probability to observe h or more intersection genes by chance:

Fisher’s exact test

• There are only two possibilities to observe more extremely than observation:

Inside cluster Outside cluster Total

Inside pathway F 39 1 40

Outside pathway F 161 1799 1960

Total 200 1800 2000

39 1 40

161 1799 1960

Total 200 1800 2000

40 0 40

160 1800 1960

Total 200 1800 2000

200 1800 200 1800

39 1 40 0p-value 0

2000 2000

Observation:

2. Enrichment analysisFisher’s exact test

Kolmogorov-Smirnov test (KS test)

-- A major issue of Fisher’s exact test is that it requires an ad hoc threshold to generate DE gene list.

-- KS test is a better way to associate any gene order with a pathway information.

Example: S1=(1,2,3,5), S2=(4,6,8,9,10)

D=maxx |F1(x)-F2(x)|

In practice, we need to search through thousands of GO terms to determine which GO term is enriched in the selected gene set .

Multiple comparison problem!!

Difficulties: Tests are highly dependent.

1.Hierarchical structure of the GOe.g. “Cell Proliferation” is a parent GO term of “Cell

Cycle”.2.Each gene can belong to multiple GO terms.

e.g. human HoxA7 gene belongs to four GO terms: “Development”, “Nucleus”, “DNA dependent regulation and transcription”, “Transcription factor activity”.

2. Enrichment analysisSimple and Naïve way:1.Get p-values from Fisher’s exact test for all pathways.2.Correct by Benjamini-Hochberg procedure to control FDR.

Problem:1.Fisher’s test simplify DE statistics into a biomarker list (0-1).2.Does not consider gene dependence structure and pathway hierarchical dependence structure.

Improved methods:1.Use averaged t-statistics or Kolmogorov-Smirnov (KS) statistics as the pathway-specific enrichment score.2.Apply permutation test (either gene permutation or sample permutation) to perform FDR control.3.Read the following papers if interested.

Goeman, J.J. and Buhlmann, P. (2007) Analyzing gene expression data in terms of gene sets: methodological issues, Bioinformatics, 23, 980-987.Tian, L., Greenberg, S.A., Kong, S.W., Altschuler, J., Kohane, I.S. and Park, P.J. (2005) Discovering statistically significant pathways in expression profiling studies, Proceedings of the National Academy of Sciences of the United States of America, 102, 13544-13549.Efron, B. and Tibshirani, R. (2007) On testing the significance of sets of genes, Annals of Applied Statistics, 1, 107-129.Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S. and Mesirov, J.P. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences of the United States of America, 102, 15545-15550.

Simple Fisher’s exact test: •Ingenuity Pathway

A commercial package with good interface and human curated annotation. Can generate network figures.

• NIH DAVIDFree and web-based. Perform enrichment analysis (Fisher’s exact test), adjust for multiple comparison and generate a table of results. Use multiple databases.

• “Gostats” package in BioconductorFree and web-based. Perform enrichment analysis (Fisher’s exact test) and generate a table of results. Use only GO database.

More sophisticated and systematic methods:• Gene set enrichment analysis (GSEA; MIT Mesirov’s group) http://www.broad.mit.edu/gsea/ (free)• Gene set analysis (GSA; Stanford Tibshirani’s group) http://www-stat.stanford.edu/~tibs/GSA/ (free)• Ingenuity Pathway Analysis (IPA) http://www.ingenuity.com/ (commercial; Pitt has purchases licenses)

Things to note when using biological database:

1. Biological pathways and gene functions are complex and difficult to quantify.

2. Data may not be accurate. The analysis should take into account of strength of evidence.

3. May need to go to specific database for particular organism. (e.g. SGD for yeast; FlyBase and BDGP for fly)

4. To systematically collect and manage massive biological knowledge from publications and experiments is an important and active research topic in bioinformatics.

3. Motif Finding

http://web.indstate.edu/thcme/mwking/gene-regulation.html

Factor Sequence Motif Comments

c-Myc and Max CACGTG c-Myc first identified as retroviral oncogene; Max specifically associates with c-Myc in cells

c-Fos and c-Jun TGAC/GTC/AA both first identified as retroviral oncogenes; associate in cells, also known as the factor AP-1

CREB TGACGC/TC/A

binds to the cAMP response element; family of at least 10 factors resulting from different genes or alternative splicing; can form dimers with c-Jun

c-ErbA; also TR (thyroid hormone receptor)

GTGTCAAAGGTCAfirst identified as retroviral oncogene; member of the steroid/thyroid hormone receptor superfamily; binds thyroid hormone

c-Ets G/CA/CGGAA/TG

T/Cfirst identified as retroviral oncogene; predominates in B- and T-cells

GATA T/AGATA family of erythroid cell-specific factors, GATA-1 to -6

c-Myb T/CAACG/TGfirst identified as retroviral oncogene; hematopoietic cell-specific factor

MyoD CAACTGAC controls muscle differentiation

NF-(kappa)B and c-Rel GGGAA/CTNT/CCC(1) both factors identified independently; c-Rel first identified as retroviral oncogene; predominate in B- and T-cells

RAR (retinoic acid receptor)

ACGTCATGACCT binds to elements termed RAREs (retinoic acid response elements) also binds to c-Jun/c-Fos site

SRF (serum response factor)

GGATGTCCATATTAGGACATCT

exists in many genes that are inducible by the growth factors present in serum

3. Motif Finding

http://web.indstate.edu/thcme/mwking/gene-regulation.html

• Genes in a cluster have similar expression patterns.

• They might share common regulatory motifs so they are expressed simultaneously.

• It is of interest to find motifs from the gene clusters.

3. Motif Finding

The following materials are obtained from Shirley Liu at Harvard.

3. Motif Finding

agenda

ontology terms

ontology tree

universe of genes

gene ontologyontologies

gene ontologyic

gene ontologydemo

fuctions of genes

filter genes

Documents

(public pack)agenda supplement- youth funding agenda

agenda time agenda item individual responsible purpose

inter-american proposals for the work … · web viewpage...

agenda city commission regular meeting agenda …

agenda-setting and agenda-building in higher …

agenda item 3.8.3. agenda item summary

agenda - bluewater health · 2019-12-16 · agenda item...

rccdistrict.netrccdistrict.net/eb/pdf...

regular agenda · regular agenda . a. approval of agenda ....

agenda 1.00 agenda/minutes

37 annual conference & agm th agenda agenda

bpe usa agenda - full agenda

agenda item 1.2 agenda - wcpfc tcc14 ou… · agenda item...

agenda 21 ph agenda 21

agenda-building and agenda-setting in corporate …

agenda item 1.1.1 refers - home | city of albany · agenda...

direct agenda marketing agenda data conference …

agenda...agenda town planning committee

art 1: regular agenda items - madhya pradesh · 139 th...

agenda item 1 adoption of agenda