methods and algorithms for gene prediction - cjk bioinfo · gene prediction methods (1)...
TRANSCRIPT
![Page 1: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/1.jpg)
Methods and Algorithms
for Gene Prediction
Chaochun Wei 韦朝春 Sc.D.
http://cbb.sjtu.edu.cn/~ccwei
Shanghai Jiao Tong UniversityShanghai Center for Bioinformation Technology
5/12/2011 K-J-C Bioinformatics Training Course1
![Page 2: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/2.jpg)
Outline
1. Introduction
2. Gene prediction methods Gene prediction methods
HMM
TWINSCAN and N-SCAN
Using ESTs for gene prediction
Resources
Latest progress
3. Gene Prediction FAQs
2
![Page 3: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/3.jpg)
1. Introduction: DNA
DNA contains genetic
information.
DNA can be
expressed as a
sequence of letters
A,C,G and T.
Eg: ACGTTTCGAGGT
3
![Page 4: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/4.jpg)
DNARNAProtein
4
& processing
![Page 5: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/5.jpg)
RNA Processing
5
5’Primary
mRNA
![Page 6: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/6.jpg)
Introduction: Gene Structure
6
5’UTR
Initial
Exon
Internal
Exon
Terminal
Exon
3’UTR
Intron
ATG TAA
TAG
TGA
Intron Intron
GT AG
A gene is a highly structured region of DNA, it is a functional
unit of inheritance.
![Page 7: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/7.jpg)
7
Josep F. Abril et al. Genome Res. 2005; 15: 111-119
Patterns in Splice Sites
Donor Sites Acceptor Sites
Sequence data from RefSeq of human, mouse, rat and chicken.
![Page 8: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/8.jpg)
A Typical Human Gene
Structure
8
![Page 9: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/9.jpg)
Genes in a Genome
9
![Page 10: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/10.jpg)
In a Mammalian Genome
Finding all the genes is hard
Mammalian genomes are large
8,000 km of 10pt type
Only about 1% protein coding
10
![Page 11: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/11.jpg)
DNA, mRNA, cDNA and EST
EST: Short (~650bs)
High error rate (~1-5%)
Contains only UTRs or coding regions11
5’EST 3’EST
Full-length cDNA
5’EST 3’EST
5’ Truncated cDNA
cDNA
mRNA
Reverse Transcription
5’ 3’
3’5’ 5’ 3’
Primary
mRNARNA Processing
5’ 3’
DNA5’
Transcription
3’Coding RegionUTR UTR
![Page 12: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/12.jpg)
The Challenge and Opportunity ~3000 genomes
222 animals
93 plants
12
Better
Gene Structure
Annotation
Numbers from http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html
0
1000
2000
3000
Genome numbers
![Page 13: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/13.jpg)
2. Gene Prediction Method History
Generation
Date Feature Systems Information or Methods used
1st Early 1980s
Approximate ends of protein coding regions and non-coding regions
TestCode, Fickett 1982
GRAIL, Uberbacher and Mural 1991
splice sites
promoters
Codon usage bias
Neuro-network methods
2nd Early 1990s
A complete single gene in a short sequence
GeneID,Guigo et al. 1992
GeneParser, Snyder and Stormo 1993
FGENSH, Solovyev et al. 1994
+
Translation start sites
Stop sites
Method:HMM
3rd Mid-
1990s
Multiple complete or partial genes in a long sequence
Genscan, Burge and Karlin, 1997
UTR
Method: Generalized HMM
4th 2000s Complete gene structures in whole genomes
Twinscan, Korf, et al.2001
N-SCAN, Brown,et al. 2006
Twinscan_EST, N-SCAN_EST, Wei and Brent, 2006
Multiple-genomes
Transcript products
Method:Generalized HMM
Bayesian approach 13
![Page 14: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/14.jpg)
Gene Prediction Methods (1)
Categorization: by input information1. Ab initio methods Only need genomic sequences as input
GENSCAN (Burge 1997; Burge and Karlin 1997)
GeneFinder (Green, unpublished)
Fgenesh (Solovyev and Salamov 1997)
Can predict novel genes
2. Transcript-alignment-based methods Use cDNA, mRNA or Protein similarity as major clues
ENSEMBL (Birney, Clamp, et. al. 2004)
Highly accurate
Can only find genes with transcript evidences cDNA coverage 50-60%
+ EST coverage up to 85%14
![Page 15: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/15.jpg)
Gene Prediction Methods (2)
Categorization: by input information
3. Hybrid Methods
Integrate cDNA, mRNA, protein and EST alignments
into ab initio methods
Genie (Kulp, Haussler et al. 1996)
Fgenesh+ (Solovyev and Salamov 1997)
Genomescan (Yeh, Lim et al. 2001)
GAZE (Howe, Chothiea et al. 2002)
AUGUSTUS+ (Stanke, Schoffmann et al. 2006)
15
![Page 16: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/16.jpg)
Gene Prediction Methods(3)
Comparative-Genomics-Based Methods
TWINSCAN and N-SCAN
De novo
Assumption:
Coding regions are more conserved.
No transcript similarity information (like EST, cDNA,
mRNA, or protein) is used
TWINSCAN-EST and N-SCAN_EST
Hybrid
Use EST to improve prediction accuracy
16
![Page 17: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/17.jpg)
17
TWINSCAN: A Novel Gene Prediction
System Using Dual Genomes
Conservation sequences represent the
conservation patterns between two genomes
No transcript similarity information (like EST,
cDNA, mRNA, and protein) is used
![Page 18: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/18.jpg)
Hidden Markov Model:
Model behind gene predictors
),(maxarg
xP
18
HMM for two biased coins flipping
1 2
7.0)(,3.0)(,2.0)(,8.0)( 2211 TeHeTeHe
TTHHTTHTTTTTHTHHHHHTHTH Observed sequence x
11221111122221111112222 Hidden state sequence
0.9
0.1
0.7
0.2
End0.1
TTHHTTHTTTTTHTHHHHHTHTH Observed sequence x
![Page 19: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/19.jpg)
Most Probable Path and Viterbi
Algorithm
).)1((maxarg)(
);)1((max)()(
klkk
i
klkk
ill
aiflptr
aifxeif
)),,...,,,...,(Pr(max)( 1010},...,{ 10
lxxif iiili
)( 2LNO )(NLO19
1
2sta
tes
max
Recursion (i=1…L)
Let
Time complexity space complexity
0 1 2 i-1 i L-1 L
![Page 20: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/20.jpg)
Probability of All the Possible Paths
and Forward Algorithm
k
k LfxPxP )(),()(
k
klkill aifxeif ))1(()()(
),,...,Pr()( 10 lxxif iil
20
sta
tes
Recursion (i=1…L)
Let
1
2+
0 1 2 i-1 i L-1 L
Probability of all the
probable paths
![Page 21: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/21.jpg)
Posterior Probability and Forward
and Backward Algorithm
)(
),()|(
xP
xkPxkP i
i
21
sta
tes
Posterior Probability
1
2+
0 1 2 i-1 i L-1 L
![Page 22: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/22.jpg)
Posterior Probability and Forward
and Backward Algorithm
l
lll bxeaxPxP )1()(),()( 10
l
lilklk ibxeaib )1()()( 1
)(
)()(
)(
),()|(
xP
ibif
xP
xkPxkP kki
i
22
sta
tes
Recursion (i=L-1,…,1)
Posterior Probability
)|,...,Pr()( 1 kxxLb iLik Let
1
2+
0 1 2 i-1 i L-1 L
![Page 23: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/23.jpg)
TWINSCAN Model
Generalized HMM
Each feature in a gene
structure corresponds to
one state.
State-specific length
models.
State-specific sequence
models
Use Conservation
information
23
![Page 24: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/24.jpg)
Conservation Sequence
Pair each nucleotide of the target with
“|” if it is aligned and identical
“:” if it is aligned to mismatch
“.” if it is unaligned
24
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
Generated by projecting local alignments to the target
sequence
human
mouse
||||||. . . . . . . . .|:|||||||||::||:|| . . . ||:
CTAGAG---------AGACAGGTACCATAGGGCTCTCCT
![Page 25: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/25.jpg)
N-SCAN: A Novel Gene Prediction
System Using Multiple Genomes
Uses Bayesian
model to include
phylogenetic tree
information
Predicts introns in
5’UTR
Has Conserved
non-coding regions
25
(Brown, Gross and Brent, Genome Res. 2005)
![Page 26: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/26.jpg)
Using ESTs for Gene Prediction:
TWINSCAN_EST
Integrating EST alignment information into TWINSCAN to
improve its accuracy where EST evidence exits and not to
compromise its ability to predict novel genes.
26
![Page 27: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/27.jpg)
Sequence Representation of EST
Alignments
27
1. Use EST-to-genome alignment programs
• BLAT (Kent 2002)
2. Project the top alignment for each EST to the target
genomic sequence
EST
Alignments
ESTseq
EEE III NNN NNNNNN NNNEEE EEEIII
![Page 28: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/28.jpg)
Accuracy Measurement
Annotated data sets for training/testing
RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/)
CCDS (http://www.ncbi.nlm.nih.gov/CCDS/)
Accuracy in different levels
Nucleotide level
Exon level
Gene level
Transcript level
Sensitivity and specificity28
![Page 29: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/29.jpg)
Accuracy Measurement (continue)
AnnotationTotal
edictionCorrectySensitivit
_
Pr_
edictionTotal
edictionCorrectySpecificit
Pr_
Pr_
29
Annotation Prediction
Correct
Prediction
![Page 30: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/30.jpg)
TWINSCAN_EST and N-SCAN_EST
on the Whole Human Genome
24
14
67
56
34
17
81
58
38
22
85
59
44
23
88
60
0
10
20
30
40
50
60
70
80
90
100
exact gene
sensitivity
exact gene
specificity
exact exon
sensitivity
exact exon
specificity
perc
ent
TWINSCAN2.03
TWINSCAN_EST
N-SCAN
N-SCAN_EST
30
![Page 31: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/31.jpg)
An Example of N-SCAN_EST
Prediction
31(Hg17, chr21:33,459,500-33,465,411)
![Page 32: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/32.jpg)
An Example of N-SCAN_EST Prediction
32
![Page 33: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/33.jpg)
Experimental Validation of Predictions
33
Siepel, Genome Research, 2007
![Page 34: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/34.jpg)
Experimental Validation of Predictions
See
The MGC Project Team, “The Completion of the
Mammalian Gene Collection (MGC)”, Genome Research,
2009, 19:2324-2333
Wei, C., et al., “Closing in on the C.elegans ORFeome by
Cloning TWINSCAN predictions”, Genome Research,
2005, 15:577-582.
Tenney, A. E. et al., “Gene prediction and verification in a
compact genome with numerous small introns”, Genome
Research, 2004
34
![Page 35: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/35.jpg)
N-SCAN/TWINSCAN Webserver:
http://mblab.wustl.edu/nscan/submit
35
![Page 36: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/36.jpg)
Resources for Gene Prediction
Sequence data sets
Nucleotide Sequences (NCBI)
dbEST
mRNA
cDNA
Annotations
RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/)
CCDS (http://www.ncbi.nlm.nih.gov/CCDS/)
Genome Browser
UCSC Genome Browser(http://genome.ucsc.edu/)
36
![Page 37: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/37.jpg)
Latest Progress in Gene Prediction
New Methods
Conrad: gene prediction using conditional random fields.
Decaprio et al., Genome Res. 2007 Sep;17(9):1389-98. Not working for vertebrate genomes
SVM for splice site
Sonneburg et al., BMC Bioinformatics. 2007;8 Suppl 10:S7.
CONTRAST: Gross et al., Genome Biology 2007, 8:R269
Best de novo gene predictor for human (gene level accuracy ~50%)
Used SVM and conditional random field
37
![Page 38: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/38.jpg)
3. Gene Prediction FAQs (from Ian Korf)
Algorithms vs. experts
Q: are expert biologists better than computer programs?
A: Yes and no.
Next-generation sequencing
Q: Will next-gen transcript sequencing replace gene
prediction?
A: No. Rare transcripts may require directed experiments
to validate.
Prediction accuracy
Q: Why are gene prediction programs inaccurate?
A: We don’t always know why. 38
![Page 39: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/39.jpg)
Gene Prediction FAQs (continue)
Genes in my favorate genome…
Q: There is no gene predictor for it, what should I do?
39
![Page 40: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/40.jpg)
Gene prediction for algal genomes
40
Macroalgae
Microalgae
![Page 41: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/41.jpg)
A Typical Alga Gene Structure
41
A Typical Human Gene Structure
![Page 42: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/42.jpg)
Gene Prediction FAQs (continue)
Genes in my favorate genome…
Q: There is no gene predictor for it, what should I do?
A: Training a gene predictor or use one that is for
another organism that is close to this genome. But it
may be inaccurate.
Difficult genes
Q: why some genes are not predicted by any program?
A: They are statistical outliers.
42
![Page 43: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/43.jpg)
Gene Prediction FAQs (continue)
Just coding exons…
Q: why other parts are not predicted, such as non-
coding exons, alternative isoforms, non-canonical
splice sites, gene within genes?
A: There are trade offs.
Pseudogenes
Q: why do some gene predictions have tiny introns?
A: Retro-pseudo genes often have very strong coding
signals, because they are derived from highly
expressed genes.43
![Page 44: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/44.jpg)
Gene Prediction FAQs (end)
How can I tell a good gene prediction from a
bad one?
Scores have been assigned to every exon
and intron of a gene. People can tell if a gene
prediction is good or not by the scores of
exons and introns of this gene.
You may have to run the program on your
own computer to figure them out!
44
![Page 45: Methods and Algorithms for Gene Prediction - CJK Bioinfo · Gene Prediction Methods (1) Categorization: by input information 1. Ab initio methods Only need genomic sequences as input](https://reader030.vdocument.in/reader030/viewer/2022040600/5e8a8de77a6291549d1e94e9/html5/thumbnails/45.jpg)
Acknowledgement
Training program organizers
Former lab members (WashU) Dr. Michael Brent
Dr. Paul Flicek
Dr. Ian Korf
Samuel Gross
45