carlo colantuoni & rafael irizarry april 19, 2006 ccolantu@jhsph.edu gene annotation in genomics...

Post on 16-Dec-2015

218 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Carlo Colantuoni&

Rafael Irizarry

April 19, 2006

ccolantu@jhsph.edu

Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Biological Setup

Every cell in the human body contains the entire human genome: 3.3 Gb or ~30K genes.

The investigation of gene expression is meaningful because different cells, in different environments, doing different jobs express different genes.

Tasks necessary for gene expression analysis:

Define what a gene is.

Identify genes in a sea of genomic DNA where <3% of DNA is contained in genes.

Design and implement probes that will effectively assay expression of ALL (most? many?) genes simultaneously. Cross-reference these probes.

Cellular Biology, Gene Expression, and Microarray Analysis

DNA

RNA

Protein

AAAAA

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

START STOPprotein coding

5’ UTR 3’ UTR

mRNA

GenomicDNA 3.3 Gb

DNAProbe

~30K genes

Sequence is a Necessity

From Genomic DNA to mRNA Transcripts

EXONS INTRONS

RNA editing & SNPs

Alternative splicingAlternative start & stop sites in same RNA molecule

~30K

>30K

Transcript coverage Homology to other transcripts

Hybridization dynamics 3’ bias

Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.

Sequence Quality!

Redundancy!

Completeness?

Unsurpassed as source of expressed sequence

Chaos?!?

From Genomic DNA to mRNA Transcripts

~30K

>30K

>>30K

Transcript-BasedGene-Centered Information

Possible mis-referencing:Genomic GenBank Acc.#’sReferenced ID has more NT’s than probeOld DB buildsDB or table errors – copying and pasting 30K rows in excel …

Using RefSeq’s can help.

Design of Gene Expression Probes

Content: UniGene, Incyte, Celera Expressed vs. Genomic

Source: cDNA libraries, clone collections, oligos

Cross-referencing of array probes (across platforms):

Sequence <> GenBank <> UniGene <> HomoloGene

From Genomic DNA to mRNA Transcripts

From Genomic DNA to mRNA Transcripts

http://www.ncbi.nlm.nih.gov/Entrez/

Functional Annotation of Lists of Genes

KEGGPFAM

SWISS-PROTGO

DRAGONDAVID

BioConductor

Analysis of Functional Gene Groups

•One of the largest challenges in analyzing genomic data is associating the experimental data with the available metadata, e.g. sequence, gene annotation, chromosomal maps, literature.

•The annotate and AnnBuilder packages provides some tools for carrying this out.

•Using AnnBuilder. It is possible to build associations with specific gene lists, eg. hgu95a package for Affymetrix HGU95A GeneChips.

•The annotate package maps to GenBank accession number, LocusLink LocusID, gene symbol, gene name, UniGene cluster, chromosome, cytoband, physical distance (bp), orientation, Gene Ontology Consortium (GO), PubMed PMID.

Analysis of Functional Gene Groups

Functional Gene/Protein Networks

DIPBINDMINTHPRD

PubGenePredicted Protein Interactions

Analysis of Gene Networks

9606 is the Taxonomy ID for Homo Sapiens

Predicted Human Protein Interactions

Predicted Human Protein Interactions

Used high-throughput protein interaction experiments from fly, worm, and yeast to predict human protein interactions.

Human protein interaction is predicted if both proteins in an interaction pair from other organism have high sequence homology to human proteins.

>70K Hs interactions predicted>6K Hs genes

Analysis of Gene Networks

Carlo ColantuoniClinical Brain Disorders Branch, NIMH, NIH

Dept. Biostatistics, JHSPHccolantu@jhsph.edu

colantuc@intra.nimh.nih.gov

Thanks to …

Rafael Irizarry

Scott Zeger

Jonathan Pevsner

http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov/Entrez/http://www.ncbi.nih.gov/Genbank/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotidehttp://www.ncbi.nlm.nih.gov/dbEST/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Proteinhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genehttp://www.ncbi.nlm.nih.gov/LocusLink/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigenehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologenehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIMhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/PubMed/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cddhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snphttp://www.ncbi.nlm.nih.gov/SNP/http://eutils.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html http://www.ncbi.nlm.nih.gov/geo/http://www.ncbi.nlm.nih.gov/RefSeq/

FTP:ftp://ftp.ncbi.nlm.nih.gov/ftp://ftp.ncbi.nlm.nih.gov/repository/UniGeneftp://ftp.ncbi.nih.gov/pub/HomoloGene/

NCBI Web Links

NUCLEOTIDE:

http://genome.ucsc.edu/

http://www.embl-heidelberg.de/

http://www.ensembl.org/

http://www.ebi.ac.uk/

http://www.gdb.org/

http://bioinfo.weizmann.ac.il/cards/index.htmlhttp://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl

PATHWAYS and NETWORKS:

http://www.genome.ad.jp/kegg/

ftp://ftp.genome.ad.jp/pub/kegg/ (http://www.genome.ad.jp/anonftp/)

http://dip.doe-mbi.ucla.edu

http://dip.doe-mbi.ucla.edu/dip/Download.cgi

http://www.blueprint.org/bind/

http://www.blueprint.org/bind/bind_downloads.html

http://160.80.34.4/mint/index.php

http://160.80.34.4/mint/release/main.php

http://www.hprd.org/

http://www.hprd.org/FAQ?selectedtab=DOWNLOAD+REQUESTS

http://www.pubgene.org/ (also .com)

PROTEIN:

http://us.expasy.org/

ftp://us.expasy.org/

http://www.sanger.ac.uk/Software/Pfam/

http://www.sanger.ac.uk/Software/Pfam/ftp.shtml

http://smart.embl-heidelberg.de/

http://www.ebi.ac.uk/interpro/

http://us.expasy.org/prosite/

ftp://us.expasy.org/databases/prosite/

More Web Links

http://www.bioconductor.org/http://apps1.niaid.nih.gov/david/http://www.geneontology.org/http://discover.nci.nih.gov/gominer/index.jsphttp://pubmatrix.grc.nia.nih.gov/http://pevsnerlab.kennedykrieger.org/dragon.htm

SAVAGE:

Detection of More Subtle Functionally Related Groups

of Gene Expression Changes

EXP#1

Swiss-Prot

30KPFAM

KEGG

~3K

10K

~40K annotations

DRAGON SAVAGE

Differential Expression of FunctionalGene Groups within One Experiment

EXP#4EXP#3EXP#2EXP#1BioDB

Differential Expression of a Single FunctionalGene Group Across Multiple Experiments

DRAGON

SAVAGE

Similar Differential Expression Patterns Across Multiple Experiments

p value

0.0

<0.1

ALL

CN

CN

CN

CN

The distribution of gene expression values for each gene group in each sample is plotted as a single point in low dimensional space. This is achieved using Principal Components Analysis along with Non-Metric Multi-Dimensional Scaling.

1

1

EX

P#1

EX

P#1

2

2

EX

P#2

EX

P#2

5

4

3

5

4

3

X

CN

X

PING:

Detection of Differential Expression in Functional

Networks of Proteins

Interaction Networks in Gene Expression Data

Large Protein Interaction Network

Network Regulated in Sample #1

Network Regulated in Sample #1

Network Regulated in Sample #2

Large Protein Interaction Network

Network Regulated in Sample #1

Network Regulated in Sample #2

Network Regulated in Sample #3

Large Protein Interaction Network

Networkof Interest

Network Regulated in Sample #1

Network Regulated in Sample #2

Network Regulated in Sample #3

Large Protein Interaction Network

PING

1

10

100

1000

10000

100000N

T's

in G

en

Ban

k (m

illio

ns)

1984 1994 2004

Genomic DNA Content

1. Interspersed repeats (~1/2 Hs. genome)2. (Processed) pseudogenes3. Simple sequence repeats4. Segmental duplications (~5% Hs. genome)5. Blocks of tandem repeats (can be very large)6. Genes: Promoters - Exons – Introns <3%

defining what a gene is - protein coding unit of genomic DNA with an mRNA intermediateidentifying genes within genomic DNA

protein-coding genes (mRNA)functional RNA genes - tRNA, rRNA, snoRNA, snRNA, miRNA

prokaryotes eukaryotes

AAAAA

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

START STOPprotein coding

5’ UTR 3’ UTR

mRNA

GenomicDNA 3.3 Gb

Protein

AAAAA

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

START STOPprotein coding

5’ UTR 3’ UTR

mRNA

GenomicDNA 3.3 Gb

Protein

~30K genes

Sequence is a Necessity

How is a gene definedin “wet” biology and in silico?

Seq. from mRNA sample

Seq. on array

Array probe design:

Source – cDNA libraries, oligos, clone collections

Content – UniGene, Celera, Incyte

Transcript coverage

Homology to other transcripts

Hybridization dynamics – hyper-multiplex hyb rxn

Empirical validation

3’ bias

Alt. splicing - known and not

Alt. start / stop site in same RNA molecule

Less important: RNA editing, SNPs

Cross-referencing of array probes:GenBank <> UniGene <> HomoloGene

Possible mis-referencing:Genomic GenBank Acc.#’sReferenced ID has more NT’s than probeOld DB buildsDB or table errors

Finding genes in eukaryotic DNA

ORF identification – Three Letter Genetic Code (codons) 4*4*4. It is possible to translate any stretch of genomic DNA into protein, but that doesn’t mean we have identified a protein coding gene!

There are several kinds of exons:-- non-coding-- initial coding exons-- internal exons-- terminal exons-- some single-exon genes are intronless

What We Are Going To Cover

Cells, Genes, Transcripts –> Genomics Experiments

Sequence Knowledge Behind Genomics Experiments

Annotation of Genes in Genomics Experiments

top related