genescout: a data mining system for predicting vertebrate genes in genomic dna sequences

17
GeneScout: a data mining system for predicting ver tebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information Sciences, 163(1-3), pp. 201-218, 2004. Advisor: Min-Shiang Hwang Speaker: Chun-Ta Li

Upload: sonja

Post on 05-Jan-2016

34 views

Category:

Documents


3 download

DESCRIPTION

GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences. Authors : Michael M. Yin and Jason T. L. Wang Sources : Information Sciences, 163(1-3), pp. 201-218, 2004. Advisor : Min-Shiang Hwang Speaker : Chun-Ta Li. Outline. Introduction Related work - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

GeneScout: a data mining system for predicting vertebrate genes in gen

omic DNA sequences

Authors: Michael M. Yin and Jason T. L. WangSources: Information Sciences, 163(1-3), pp. 201-218, 2004.Advisor: Min-Shiang HwangSpeaker: Chun-Ta Li

Page 2: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

2

Outline

• Introduction

• Related work

• The proposed approach

• Experiments and results

• Conclusion

• Comments

Page 3: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

3

Introduction – 1/4

• Data mining – knowledge discovery from data

• Data mining in life sciences:– Finding clustering rules for gene expressions– Discovering classification rules for proteins– Detecting associations between metabolic pathways– Predicting genes in genomic DNA sequences

Page 4: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

4

Introduction – 2/4

• A genomic DNA sequence– Four types of nucleotides (A, C, G, T)

• The basic structure for a vertebrate gene

• A sequence fragment containing an exon of 296 nucleotides

codon:密碼子introns:內含子exons:編碼順序donor:捐贈者

coding sequences

Page 5: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

5

Introduction – 3/4

coding region

Page 6: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

6

Introduction – 4/4

• A number of programs have been developed for locating gene coding regions (exons).

• Insufficient:– The vertebrate DNA sequence signals involved in gene determinati

on are usually ill defined.– The automated interpretation without experimental validation of ge

nomic data is still myth.

• Motivation:– GeneScout: Developing accurate methods for automatically detecti

ng vertebrate genomic DNA structures.– Exon: start sites, junction donor, acceptor sites

Page 7: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

7

Related work – 1/2

• NN-based techniques (Neural Network)– Gene structure prediction– Training

Page 8: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

8

Related work – 2/2

• HMM-based techniques (Hidden Markov Models)– To describe sequential data or processes– Using a number of states– Probabilistic state transitions– Example: cast a dice

Normal Fake

Page 9: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

9

The proposed approach – 1/4

• HMM models for predicting functional sites– Star Site Model

Start codon

1 1

Page 10: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

10

The proposed approach – 2/4

• An HMM model for computing coding potentials– The Codon Model

• First state is base T

• Second state is base A or G

• Third State can only be C or T (A, G is not defined)

Stop codons:

TAA, TAG, TGA, TGG

Page 11: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

11

The proposed approach – 3/4• Graph representation of the gene detection problem

– DNA sequence Directed acyclic graph dynamic programming algorithm optimal path

– candidate exon, candidate intron, candidate gene

: intron

:exon

Page 12: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

12

The proposed approach – 4/4

• A dynamic programming algorithm– Weight of the vertex v – W(v)

– Weight of the edge (v1,v2) – W(v1,v2)

stop

acceptorstart acceptor

donor donordonor

acceptor

Page 13: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

13

Experiments and results – 1/3• Data:

– GeneBank 570 vertebrate sequences 28,992,149 nucleotides 2649 exons 444,498 nucleotides

– start condon – ATG– donor site – GT– acceptor site – AG

• Evaluating method:– 10-way cross-validation– 570 sequences 10 sets

9 sets training data

1 set test data

Page 14: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

14

Experiments and results – 2/3

:正確認出 nucleotide的比率:正確認出 nucleotide的比率相較於誤認是 nucleotide的比率:在 nucleotide level的總預測精確度 (1~-1):正確認出 exon的比率:正確認出 exon的比率相較於誤認是 exon的比率

Page 15: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

15

Experiments and results – 3/3

• 8 sequences GeneScout correctly detected nucleotides about 85% but GeneScan did not correctly predict any coding nucleotide

• GeneScout funs much faster than GeneScan

Page 16: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

16

Conclusion• GeneScout uses hidden Markov models to detect functio

nal sites.• A vertebrate genomic DNA sequence A directed acycl

ic graph A dynamic programming algorithm optimal path

• Experiment results shows GeneScout can detect 51% of exons in the data set.

Page 17: GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

17

Comments

• Enhanced the accuracy of detect the DNA sequences:– More models or rules– Association rules known exons rules– Rules DNA sequences Candidate exons