dr. robertas damaševičius software engineering department, kaunas university of technology

24
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų 50-415, Kaunas, Lithuania [email protected]

Upload: calvin

Post on 14-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel. Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Student ų 50-415, Kaunas, Lithuania robertas.damasevicius @ktu.lt. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel

Dr. Robertas DamaševičiusSoftware Engineering Department,

Kaunas University of Technology

Studentų 50-415, Kaunas, Lithuania

[email protected]

Page 2: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 2

What is splicing?

Splicing: modification of genetic information after transcription, in which introns are removed and exons are joined

Splice junctions: boundary points between exons and introns where splicing occurs

Donor: upstream part of intron, conserved dinucleotide GT Acceptor: downstream part of intron, conserved dinucleotide AG Pseudo splice-sites

…CGATAA AG ATC..AAT GT ATCGCA…

Slice-junction site

Intron Intron Exon Acceptor Donor

Slice-junction site

Page 3: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 3

Problem Splice-junction site recognition

Important for successful gene prediction Study of genetical deseases Understanding of genetic mechanisms

Difficulties Noisy data Pseudo splice sites Non-canonical splice sites (intron is not GT...AG) Alternative splicing Multitude of consensus sequences

Machine Learning: Support Vector Machine (SVM) Feature space mapping for SVM Which frequency-based feature mapping is the best?

Page 4: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 4

Support Vector Machine (SVM)

are training data vectors, are unknown data vectors

, is a target space

is the kernel function.

SVxjiiij

i

bxxKyxg ,sgn

ji xxK ,

Xxi 1,1 YYyi

Xx j

Page 5: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 5

What factors influence quality of classification? Training data

size of dataset, generation of negative examples, imbalanced datasets

Mapping of data into feature space Orthogonal, single nucleotide, nucleotide grouping, ...

Selection of an optimal kernel function linear, polynomial, RBF, sigmoid

Kernel parameters SVM learning parameters

Regularization parameter, Cost factor

Page 6: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 6

SVM feature space

Feature space: multidimensional vector representing data instances

Mapping of data into features: achieving better classification accuracy

Feature space construction: nucleotide position-dependent nucleotide position-independent both nucleotide position-dependent and -independent information

Feature mapping rule:

N – the length of a DNA sequence, M – the length of feature vector

MN fffFsssSFSM ,...,,,,...,,ˆ,ˆ: 2121

Page 7: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 7

K-mers

K-mer: a k-base long sequence (k-tuple) of DNA

K-mer feature vector: constructed using a frequency (or probability) of each k-mer in a DNA sequence

Σ – alphabet, N – length of a DNA sequence, k – length of k-mer,

nj – number of j-th k-mer in a DNA sequence

kiSaa, . . . , , aa ik21 , . . . 2, 1, ,ˆ,

kNjj jS

kN

npS

,...,1,ˆ,1

ˆ

Page 8: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 8

K-mer frequency mapping rules 4-letter (ACGT) : Σ = {A, C, G, T}, ||Σ|| = 4

Disadvantage: feature space growth ~ 4k

Nucleotide grouping based: SW, KM & RY SW : Σ = {S, W}, ||Σ|| = 2

Strong (C, G) nucleotides – 3 H bonds Weak (A, T) nucleotides – 2 H bonds

RY : Σ = {R, Y}, ||Σ|| = 2 A and G – purines (R) C and T – pyrimidines (Y)

KM : Σ = {K, M}, ||Σ|| = 2 A and C – amines (M) G and T – ketones (K)

Page 9: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Example: 2-mer frequency mapping

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 9

Mapping rule

ACGT SW RY KM

Sequence AAAGTC WWWSWS WWWSWS MMMKKM 2-mers AA,AC,AG,AT,

CA,CC,CG,CT, GA,GC,GG,GT,

TA,TC,TG,TT

SS,SW,WS,WW RR,RY,YR,YY KK,KM,MK,MM

Feature vector

0,0,5

1,0,

5

1,0,0,0,0,0,0,0,0,

5

1,0,

5

2 5

2,

5

2,

5

10,

5

1,0,

5

1,

5

3 5

2,

5

1,

5

1,

5

1

Page 10: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 10

Case study

Dataset: UCI repository, Genbank 64.1 primate data 3175 sequences, each (-30 bp, +30 bp) with regard to splice site

Three splice site recognition sub-problems: Exon/Intron (EI) vs. Negative (N) Intron/Exon (IE) vs. Negative (N) Exon/Intron (EI) vs. Intron/Exon (IE)

Three datasets: EI vs. N : 767 EI and 1655 N IE vs. N : 768 EI and 1655 N EI vs. IE : 767 EI and 768 EI

Power series kernel Accuracy evaluation metric: F-measure

Page 11: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 11

Classification results: Exon/Intron vs. Negative

Page 12: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 12

Classification results:Intron/Exon vs. Negative

Page 13: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 13

Classification results:Intron/Exon vs. Exon/Intron

Page 14: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 14

Classification time

Page 15: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Feature vector size

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 15

Intron/exon splice sites, 2422 sequences

Page 16: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 16

Evaluation of results

Classification accuracy: Exon/Intron vs. N. – 4-mer ACGT frequency mapping (78.05%) Intron/Exon vs. N. – 6-mer ACGT frequency mapping (70.75%) E/I vs. I/E – 6-mer ACGT frequency mapping (90.59%) 4-mers and 6-mers better than 5-mers RY always better than SW or KM

Feature space size: ACGT k-mer: 4k

SW, RY, KM k-mer: 2k

Classification speed: SW/KM/RY k-mer frequency based classification can be ~ 2

times faster than ACGT k-mer classficaion

Page 17: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 17

Why RY is better than SW or KM?Rule Donor (EI) consensus Acceptor (IE) consensus

ACGT (C|A)AG / GT(A|G)AGT (C|T)nN(C|T)AG / G

SW (S|W)SW / WS(S|W)SWS (S|W)nN(S|W)SW / W

KM KKM / MM(K|M)KMM (K|M)nN(K|M)KM / M

RY (R|Y)RR / RYRRRY YnNYRR / R Acceptor consensus sequence has long runs of Pyrimidines (Y)

Page 18: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 18

Conclusions Selection of the appropriate feature mapping rule can greatly

influence the DNA sequence classification results Anomalies in consensus sequences (such as long runs) can

be exploited for better classification results when selecting mapping rules

For trade-off between classification accuracy and speed, RY k-mer frequency based mapping can be used instead of 4-letter k-mer frequency

Open research problem: “forbidden” k-mers

Page 19: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 19

Questions?

Page 20: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 20

SVM kernel function optimization Introduction of additional kernel parameters Introduction of new kernels Power series kernel function

Advantage: more parameters for optimization better separation of classes in feature space

n

k

k

jT

ikjin cxxaxxK1

,

Page 21: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 21

SW k-mer frequency mapping rule SW ({A,T} vs. {C,G}) mapping rule

reflects the difference in the number of hydrogen bonds in the DNA molecule Strong (C, G) nucleotides - 3 H bonds Weak (A, T) nucleotides - 2 H bonds

related to physical-chemical properties of DNA transport of electrons mechanical waves along the DNA helix

kNjj jWSS

kN

npS 2,...,1,,ˆ,

Page 22: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 22

RY k-mer frequency mapping rule The RY mapping rule ({A, G} vs.{C, T})

describes how purines (R) and pyrimidines (Y) are distributed along the DNA sequence. A and G – purines (R) C and T – pyrimidines (Y)

corresponds to the chemical composition bias in the DNA strand

kNjj jYRS

kN

npS 2,...,1,,ˆ,

Page 23: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 23

KM k-mer mapping rule

The KM mapping rule ({A,C} vs. {G,T})

describes how ketones (K) and amines (M) are distributed along the DNA sequence A and C – amines (M) G and T – ketones (K)

kNjj jMKS

kN

npS 2,...,1,,ˆ,

Page 24: Dr.  Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 24

Classification metric

F-measure

Advantage: One measure that takes into account both recall and precision: a

spectacular score in one does not compensate for a bad score in the other

%1002

recallprecision

recallprecisionF

TNFP

TN

nn

nrecall

FNTP

TP

nn

nprecision