carlo colantuoni & rafael irizarry april 19, 2006 [email protected] gene annotation in genomics...

56
Carlo Colantuoni & Rafael Irizarry April 19, 2006 [email protected] Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Upload: alexandra-wheeler

Post on 16-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Carlo Colantuoni&

Rafael Irizarry

April 19, 2006

[email protected]

Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Page 2: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Biological Setup

Every cell in the human body contains the entire human genome: 3.3 Gb or ~30K genes.

The investigation of gene expression is meaningful because different cells, in different environments, doing different jobs express different genes.

Tasks necessary for gene expression analysis:

Define what a gene is.

Identify genes in a sea of genomic DNA where <3% of DNA is contained in genes.

Design and implement probes that will effectively assay expression of ALL (most? many?) genes simultaneously. Cross-reference these probes.

Page 3: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Cellular Biology, Gene Expression, and Microarray Analysis

DNA

RNA

Protein

Page 4: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

AAAAA

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

START STOPprotein coding

5’ UTR 3’ UTR

mRNA

GenomicDNA 3.3 Gb

DNAProbe

~30K genes

Sequence is a Necessity

Page 5: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

From Genomic DNA to mRNA Transcripts

EXONS INTRONS

RNA editing & SNPs

Alternative splicingAlternative start & stop sites in same RNA molecule

~30K

>30K

Transcript coverage Homology to other transcripts

Hybridization dynamics 3’ bias

Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.

Page 6: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Sequence Quality!

Redundancy!

Completeness?

Unsurpassed as source of expressed sequence

Chaos?!?

Page 7: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

From Genomic DNA to mRNA Transcripts

~30K

>30K

>>30K

Page 8: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Transcript-BasedGene-Centered Information

Page 9: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Possible mis-referencing:Genomic GenBank Acc.#’sReferenced ID has more NT’s than probeOld DB buildsDB or table errors – copying and pasting 30K rows in excel …

Using RefSeq’s can help.

Design of Gene Expression Probes

Content: UniGene, Incyte, Celera Expressed vs. Genomic

Source: cDNA libraries, clone collections, oligos

Cross-referencing of array probes (across platforms):

Sequence <> GenBank <> UniGene <> HomoloGene

Page 10: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

From Genomic DNA to mRNA Transcripts

Page 11: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 12: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 13: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

From Genomic DNA to mRNA Transcripts

Page 14: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 15: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

http://www.ncbi.nlm.nih.gov/Entrez/

Page 16: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Functional Annotation of Lists of Genes

KEGGPFAM

SWISS-PROTGO

DRAGONDAVID

BioConductor

Page 17: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Analysis of Functional Gene Groups

Page 18: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 19: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 20: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 21: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 22: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 23: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 24: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

•One of the largest challenges in analyzing genomic data is associating the experimental data with the available metadata, e.g. sequence, gene annotation, chromosomal maps, literature.

•The annotate and AnnBuilder packages provides some tools for carrying this out.

•Using AnnBuilder. It is possible to build associations with specific gene lists, eg. hgu95a package for Affymetrix HGU95A GeneChips.

•The annotate package maps to GenBank accession number, LocusLink LocusID, gene symbol, gene name, UniGene cluster, chromosome, cytoband, physical distance (bp), orientation, Gene Ontology Consortium (GO), PubMed PMID.

Page 25: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Analysis of Functional Gene Groups

Page 26: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Functional Gene/Protein Networks

DIPBINDMINTHPRD

PubGenePredicted Protein Interactions

Page 27: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Analysis of Gene Networks

Page 28: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 29: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

9606 is the Taxonomy ID for Homo Sapiens

Page 30: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 31: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 32: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 33: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Predicted Human Protein Interactions

Page 34: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Predicted Human Protein Interactions

Used high-throughput protein interaction experiments from fly, worm, and yeast to predict human protein interactions.

Human protein interaction is predicted if both proteins in an interaction pair from other organism have high sequence homology to human proteins.

>70K Hs interactions predicted>6K Hs genes

Page 35: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Analysis of Gene Networks

Page 36: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Carlo ColantuoniClinical Brain Disorders Branch, NIMH, NIH

Dept. Biostatistics, [email protected]

[email protected]

Thanks to …

Rafael Irizarry

Scott Zeger

Jonathan Pevsner

Page 37: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor
Page 38: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov/Entrez/http://www.ncbi.nih.gov/Genbank/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotidehttp://www.ncbi.nlm.nih.gov/dbEST/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Proteinhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genehttp://www.ncbi.nlm.nih.gov/LocusLink/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigenehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologenehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIMhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/PubMed/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cddhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snphttp://www.ncbi.nlm.nih.gov/SNP/http://eutils.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html http://www.ncbi.nlm.nih.gov/geo/http://www.ncbi.nlm.nih.gov/RefSeq/

FTP:ftp://ftp.ncbi.nlm.nih.gov/ftp://ftp.ncbi.nlm.nih.gov/repository/UniGeneftp://ftp.ncbi.nih.gov/pub/HomoloGene/

NCBI Web Links

Page 39: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

NUCLEOTIDE:

http://genome.ucsc.edu/

http://www.embl-heidelberg.de/

http://www.ensembl.org/

http://www.ebi.ac.uk/

http://www.gdb.org/

http://bioinfo.weizmann.ac.il/cards/index.htmlhttp://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl

PATHWAYS and NETWORKS:

http://www.genome.ad.jp/kegg/

ftp://ftp.genome.ad.jp/pub/kegg/ (http://www.genome.ad.jp/anonftp/)

http://dip.doe-mbi.ucla.edu

http://dip.doe-mbi.ucla.edu/dip/Download.cgi

http://www.blueprint.org/bind/

http://www.blueprint.org/bind/bind_downloads.html

http://160.80.34.4/mint/index.php

http://160.80.34.4/mint/release/main.php

http://www.hprd.org/

http://www.hprd.org/FAQ?selectedtab=DOWNLOAD+REQUESTS

http://www.pubgene.org/ (also .com)

PROTEIN:

http://us.expasy.org/

ftp://us.expasy.org/

http://www.sanger.ac.uk/Software/Pfam/

http://www.sanger.ac.uk/Software/Pfam/ftp.shtml

http://smart.embl-heidelberg.de/

http://www.ebi.ac.uk/interpro/

http://us.expasy.org/prosite/

ftp://us.expasy.org/databases/prosite/

More Web Links

http://www.bioconductor.org/http://apps1.niaid.nih.gov/david/http://www.geneontology.org/http://discover.nci.nih.gov/gominer/index.jsphttp://pubmatrix.grc.nia.nih.gov/http://pevsnerlab.kennedykrieger.org/dragon.htm

Page 40: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

SAVAGE:

Detection of More Subtle Functionally Related Groups

of Gene Expression Changes

Page 41: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

EXP#1

Swiss-Prot

30KPFAM

KEGG

~3K

10K

~40K annotations

DRAGON SAVAGE

Differential Expression of FunctionalGene Groups within One Experiment

Page 42: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

EXP#4EXP#3EXP#2EXP#1BioDB

Differential Expression of a Single FunctionalGene Group Across Multiple Experiments

DRAGON

SAVAGE

Page 43: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Similar Differential Expression Patterns Across Multiple Experiments

p value

0.0

<0.1

ALL

CN

CN

CN

CN

The distribution of gene expression values for each gene group in each sample is plotted as a single point in low dimensional space. This is achieved using Principal Components Analysis along with Non-Metric Multi-Dimensional Scaling.

1

1

EX

P#1

EX

P#1

2

2

EX

P#2

EX

P#2

5

4

3

5

4

3

X

CN

X

Page 44: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

PING:

Detection of Differential Expression in Functional

Networks of Proteins

Page 45: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Interaction Networks in Gene Expression Data

Page 46: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Large Protein Interaction Network

Network Regulated in Sample #1

Page 47: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Network Regulated in Sample #1

Network Regulated in Sample #2

Large Protein Interaction Network

Page 48: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Network Regulated in Sample #1

Network Regulated in Sample #2

Network Regulated in Sample #3

Large Protein Interaction Network

Page 49: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Networkof Interest

Network Regulated in Sample #1

Network Regulated in Sample #2

Network Regulated in Sample #3

Large Protein Interaction Network

PING

Page 50: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

1

10

100

1000

10000

100000N

T's

in G

en

Ban

k (m

illio

ns)

1984 1994 2004

Page 51: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Genomic DNA Content

1. Interspersed repeats (~1/2 Hs. genome)2. (Processed) pseudogenes3. Simple sequence repeats4. Segmental duplications (~5% Hs. genome)5. Blocks of tandem repeats (can be very large)6. Genes: Promoters - Exons – Introns <3%

defining what a gene is - protein coding unit of genomic DNA with an mRNA intermediateidentifying genes within genomic DNA

protein-coding genes (mRNA)functional RNA genes - tRNA, rRNA, snoRNA, snRNA, miRNA

prokaryotes eukaryotes

Page 52: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

AAAAA

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

START STOPprotein coding

5’ UTR 3’ UTR

mRNA

GenomicDNA 3.3 Gb

Protein

Page 53: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

AAAAA

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

START STOPprotein coding

5’ UTR 3’ UTR

mRNA

GenomicDNA 3.3 Gb

Protein

~30K genes

Sequence is a Necessity

Page 54: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

How is a gene definedin “wet” biology and in silico?

Seq. from mRNA sample

Seq. on array

Array probe design:

Source – cDNA libraries, oligos, clone collections

Content – UniGene, Celera, Incyte

Transcript coverage

Homology to other transcripts

Hybridization dynamics – hyper-multiplex hyb rxn

Empirical validation

3’ bias

Alt. splicing - known and not

Alt. start / stop site in same RNA molecule

Less important: RNA editing, SNPs

Cross-referencing of array probes:GenBank <> UniGene <> HomoloGene

Possible mis-referencing:Genomic GenBank Acc.#’sReferenced ID has more NT’s than probeOld DB buildsDB or table errors

Page 55: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Finding genes in eukaryotic DNA

ORF identification – Three Letter Genetic Code (codons) 4*4*4. It is possible to translate any stretch of genomic DNA into protein, but that doesn’t mean we have identified a protein coding gene!

There are several kinds of exons:-- non-coding-- initial coding exons-- internal exons-- terminal exons-- some single-exon genes are intronless

Page 56: Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

What We Are Going To Cover

Cells, Genes, Transcripts –> Genomics Experiments

Sequence Knowledge Behind Genomics Experiments

Annotation of Genes in Genomics Experiments