eukaryotic gene finding adapted in part from

27
Eukaryotic Gene Finding Adapted in part from http://online.itp.ucsb.edu/online/infobio0 1/burge/

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Eukaryotic Gene Finding Adapted in part from

Eukaryotic Gene Finding

Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/

Page 2: Eukaryotic Gene Finding Adapted in part from

Prokaryotic vs. Eukaryotic Genes

Prokaryotessmall genomes

high gene density

no introns (or splicing)

no RNA processing

similar promoters

overlapping genes

Eukaryoteslarge genomes

low gene density

introns (splicing)

RNA processing

heterogeneous promoters

polyadenylation

Page 3: Eukaryotic Gene Finding Adapted in part from
Page 4: Eukaryotic Gene Finding Adapted in part from

exonic enhancers

5’ splice signal 3’ splice signalpolyY

branch signal

intronic enhancers

exonic repressor

U2A

F6

5

U2A

F3

5

U1

snRN

P

SR proteins

U1

snRN

P

U2

snRN

Pintronic repressor

5’ splice signal

exon definitionintron definition

Pre-mRNA Splicing

...

...

(assembly of spliceosome, catalysis)

Page 5: Eukaryotic Gene Finding Adapted in part from
Page 6: Eukaryotic Gene Finding Adapted in part from

Some Statistics

• On average, a vertebrate gene is about 30KB long

• Coding region takes about 1KB• Exon sizes can vary from double digit

numbers to kilobases• An average 5’ UTR is about 750 bp• An average 3’UTR is about 450 bp but both

can be much longer.

Page 7: Eukaryotic Gene Finding Adapted in part from

5' splice signal

3' splice signal

Human Splice Signal Motifs

Page 8: Eukaryotic Gene Finding Adapted in part from
Page 9: Eukaryotic Gene Finding Adapted in part from
Page 10: Eukaryotic Gene Finding Adapted in part from

Semi-Markov HMM Model

Page 11: Eukaryotic Gene Finding Adapted in part from

Genscan HSMM

Page 12: Eukaryotic Gene Finding Adapted in part from

GenScan States

• N - intergenic region• P - promoter• F - 5’ untranslated region

• Esngl – single exon )intronless( )translation

start -> stop codon(

• Einit – initial exon )translation start ->

donor splice site(

• Ek – phase k internal exon )acceptor splice

site -> donor splice site(

• Eterm – terminal exon )acceptor splice site -

> stop codon(

• Ik – phase k intron: 0 – between codons; 1

– after the first base of a codon; 2 – after the second base of a codon

Page 13: Eukaryotic Gene Finding Adapted in part from

GenScan features

• Model both strands at once• Each state may output a string of symbols

(according to some probability distribution).• Explicit intron/exon length modeling• Advanced splice site modeling• Parameters learned from annotated genes• Separate parameter training for different CpG

content groups

Page 14: Eukaryotic Gene Finding Adapted in part from
Page 15: Eukaryotic Gene Finding Adapted in part from

GenScan Signal Modeling

• PSSM: P(S) = P1(S1)•P2(S2) •…•Pn(Sn)

– PolyA signal– Translation initiation/termination signal– Promoters

• WAM: P(S) = P1(S1) •P2(S2|S1)•…•Pn(Sn|Sn-1)– 5’ and 3’ splice sites

Page 16: Eukaryotic Gene Finding Adapted in part from

HMM-based Gene Finding

GENSCAN (Burge 1997)

FGENESH (Solovyev 1997)

HMMgene (Krogh 1997)

GENIE (Kulp 1996)

GENMARK (Borodovsky & McIninch 1993)

VEIL (Henderson, Salzberg, & Fasman 1997)

Page 17: Eukaryotic Gene Finding Adapted in part from

GenomeScan

• Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan)

• Focus on ‘typical case’ when homologous but not identical proteins are available.

• Idea: We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons.

Page 18: Eukaryotic Gene Finding Adapted in part from
Page 19: Eukaryotic Gene Finding Adapted in part from
Page 20: Eukaryotic Gene Finding Adapted in part from

GeneWise [Birney, Amitai]

• Motivation: Use good DB of protein world (PFAM) to help us annotate genomic DNA

• GeneWise algorithm aligns a profile HMM directly to the DNA

Page 21: Eukaryotic Gene Finding Adapted in part from

Sample GeneWise Output

Page 22: Eukaryotic Gene Finding Adapted in part from

Developing GeneWise Model

• Start with a PFAM domain HMM

• Replace AA emissions with codon emissions

)|()|()|( ii MaaPaacodonPMcodonP

•Allow for sequencing errors (deletions/insertions)•Add a 3-state intron model

Page 23: Eukaryotic Gene Finding Adapted in part from

GeneWise Model

Page 24: Eukaryotic Gene Finding Adapted in part from

GeneWise Intron Model

central

PY tract

spacer

5’ site 3’ site

Page 25: Eukaryotic Gene Finding Adapted in part from

GeneWise Model

• Viterbi algorithm -> “best” alignment of DNA to protein domain

• Alignment gives exact exon-intron boundaries

• Parameters learned from species-specific statistics

Page 26: Eukaryotic Gene Finding Adapted in part from

GeneWise problems

• Only provides partial prediction, and only where the homology lies– Does not find “more” genes

• Pseudogenes, Retrotransposons picked up

• CPU intensive– Solution: Pre-filter with BLAST

Page 27: Eukaryotic Gene Finding Adapted in part from

Summary

• Genes are complex structures which are difficult to predict with the required level of accuracy/confidence

• Different approaches to gene finding:– Ab Initio : GenScan– Ab Initio modified by BLAST homologies:

GenomeScan– Homology guided: GeneWise