genescout: a data mining system for predicting vertebrate genes in genomic dna sequences
DESCRIPTION
GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences. Authors : Michael M. Yin and Jason T. L. Wang Sources : Information Sciences, 163(1-3), pp. 201-218, 2004. Advisor : Min-Shiang Hwang Speaker : Chun-Ta Li. Outline. Introduction Related work - PowerPoint PPT PresentationTRANSCRIPT
GeneScout: a data mining system for predicting vertebrate genes in gen
omic DNA sequences
Authors: Michael M. Yin and Jason T. L. WangSources: Information Sciences, 163(1-3), pp. 201-218, 2004.Advisor: Min-Shiang HwangSpeaker: Chun-Ta Li
2
Outline
• Introduction
• Related work
• The proposed approach
• Experiments and results
• Conclusion
• Comments
3
Introduction – 1/4
• Data mining – knowledge discovery from data
• Data mining in life sciences:– Finding clustering rules for gene expressions– Discovering classification rules for proteins– Detecting associations between metabolic pathways– Predicting genes in genomic DNA sequences
4
Introduction – 2/4
• A genomic DNA sequence– Four types of nucleotides (A, C, G, T)
• The basic structure for a vertebrate gene
• A sequence fragment containing an exon of 296 nucleotides
codon:密碼子introns:內含子exons:編碼順序donor:捐贈者
coding sequences
5
Introduction – 3/4
coding region
6
Introduction – 4/4
• A number of programs have been developed for locating gene coding regions (exons).
• Insufficient:– The vertebrate DNA sequence signals involved in gene determinati
on are usually ill defined.– The automated interpretation without experimental validation of ge
nomic data is still myth.
• Motivation:– GeneScout: Developing accurate methods for automatically detecti
ng vertebrate genomic DNA structures.– Exon: start sites, junction donor, acceptor sites
7
Related work – 1/2
• NN-based techniques (Neural Network)– Gene structure prediction– Training
8
Related work – 2/2
• HMM-based techniques (Hidden Markov Models)– To describe sequential data or processes– Using a number of states– Probabilistic state transitions– Example: cast a dice
Normal Fake
9
The proposed approach – 1/4
• HMM models for predicting functional sites– Star Site Model
Start codon
1 1
10
The proposed approach – 2/4
• An HMM model for computing coding potentials– The Codon Model
• First state is base T
• Second state is base A or G
• Third State can only be C or T (A, G is not defined)
Stop codons:
TAA, TAG, TGA, TGG
11
The proposed approach – 3/4• Graph representation of the gene detection problem
– DNA sequence Directed acyclic graph dynamic programming algorithm optimal path
– candidate exon, candidate intron, candidate gene
: intron
:exon
12
The proposed approach – 4/4
• A dynamic programming algorithm– Weight of the vertex v – W(v)
– Weight of the edge (v1,v2) – W(v1,v2)
stop
acceptorstart acceptor
donor donordonor
acceptor
13
Experiments and results – 1/3• Data:
– GeneBank 570 vertebrate sequences 28,992,149 nucleotides 2649 exons 444,498 nucleotides
– start condon – ATG– donor site – GT– acceptor site – AG
• Evaluating method:– 10-way cross-validation– 570 sequences 10 sets
9 sets training data
1 set test data
14
Experiments and results – 2/3
:正確認出 nucleotide的比率:正確認出 nucleotide的比率相較於誤認是 nucleotide的比率:在 nucleotide level的總預測精確度 (1~-1):正確認出 exon的比率:正確認出 exon的比率相較於誤認是 exon的比率
15
Experiments and results – 3/3
• 8 sequences GeneScout correctly detected nucleotides about 85% but GeneScan did not correctly predict any coding nucleotide
• GeneScout funs much faster than GeneScan
16
Conclusion• GeneScout uses hidden Markov models to detect functio
nal sites.• A vertebrate genomic DNA sequence A directed acycl
ic graph A dynamic programming algorithm optimal path
• Experiment results shows GeneScout can detect 51% of exons in the data set.
17
Comments
• Enhanced the accuracy of detect the DNA sequences:– More models or rules– Association rules known exons rules– Rules DNA sequences Candidate exons