genomic sequence analysis using electron-ion interaction potential

21
Genomic Sequence Analysis using Electron-Ion Interaction Potential Masumi Kobayashi Performance Evaluation Laboratory University of Aizu

Upload: phiala

Post on 19-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Genomic Sequence Analysis using Electron-Ion Interaction Potential. Masumi Kobayashi Performance Evaluation Laboratory University of Aizu. Purpose. To find the gene regions by using Lindley Equation and Electron-Ion Interaction Potential (EIIP). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Genomic Sequence Analysis using Electron-Ion Interaction Potential

Masumi Kobayashi Performance Evaluation Laboratory

University of Aizu

Page 2: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Purpose

To find the gene regions by using Lindley Equation and Electron-Ion Interaction Potential (EIIP).

To judge similarity of two DNA sequences that shortens the processing time by using Lindley equation and Electron-Ion Interaction Potential (EIIP).

Page 3: Genomic Sequence Analysis using Electron-Ion Interaction Potential

DNA

DNA sequence consists of four nucleotide letters: A(adenine), T(thymine), G(guanine), and C(cytosine).

Base A is always paired with base T, and C is always paired with D, and DNA is double helix.

Page 4: Genomic Sequence Analysis using Electron-Ion Interaction Potential

DNA Sequence and Amino Acid Sequence A DNA sequence consists of a row of four nucleotides, and

each nucleotide triplet is called a codon. And a codon corresponds to an amino acid.

DNA Sequence | ・・・ |ATG|CGA|TAT|AAA|GCT|TTC| ・・・ |

Amino Acid Sequence

| ・・・ | M | R | L | K | A | F | ・・・ |

Codon

Page 5: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Codon 61 codons are transformed into amino acid. For example, both TTT and TTC code for Phenylalanine(F). 3 codons, TAA, TAG, and TGA are called Stop Codon.

Codon AminoAcid Codon AminoAcid Codon AminoAcid Codon AminoAcidTTT TCT TAT TGTTTC TCC TAC TGCTTA TCA TAA TGA STOPTTG TCG TAG TGG WCTT CCT CAT CGTCTC CCC CAC CGCCTA CCA CAA CGACTG CCG CAG CGGATT ACT AAT AGTATC ACC AAC AGC SATA ACA AAA AGAATG M ACG AAG AGGGTT GCT GAT GGTGTC GCC GAC GGCGTA GCA GAA GGAGTG GCG GAG GGG

A

H

Q

N

K

D

E

R

G

F

L

L

I

V

S

P

T

C

STOP

Y

R

Page 6: Genomic Sequence Analysis using Electron-Ion Interaction Potential

The waiting time of the customer of queuing theory and a DNA sequence

In order to use Lindley equation, we need to describe the relation between the waiting time of the customer of queuing theory and a DNA sequence.

A score is given for the similarity of the amino acid of two target gene sequences, and sum of score is made to correspond to waiting time of queuing theory.

Page 7: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Lindley Equation

: The score of the n-th letter.

: The sum of the score to the n-th letter.

Amino AcidSequence

nS

nW

F L I ……… M V S T

1S 2S 3S

1W

2W

3W

2S

3S

}0,max{ 1 nnn SWW

n

ikknin SW }11{max

1nS nS

1nW nSNegative

value

0nW

Page 8: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Electron-Ion Interaction Potential (EIIP) Prof. Toyoizumi and Tuchiya showed a technique to find gene coding regions by using Lindley equation. But there is a problem, the determination of score required for Lindley equation is artificial.

In this research, we decide theoretical score by using Electron-Ion Interaction Potential. Each amino acid is represented by the EIIP value, which describes the average energy states of all valance electrons in particular amino acids.

AminoAcid EIIPLeu(L) 0Ile(I) 0

Asn(N) 0.0036Gly(G) 0.005Val(V) 0.0057Glu(E) 0.0058Pro(P) 0.0198His(H) 0.0242Lys(K) 0.0371Ala(A) 0.0373Tyr(Y) 0.0561Trp(W) 0.0548Gln(Q) 0.0761Met(M) 0.0823Ser(S) 0.0829Cys( )C 0.0829Thr(T) 0.0941Phe(F) 0.0946Arg( )R 0.0959Asp(D) 0.1263

Page 9: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Gene Finding Experiment

The target sequence of this experiment is the genome data of Escherichia coil O157:H7 Sakai.

Escherichia coil O157:H7 Sakai is a major food-born infection pathogen that causes diarrhea, coilitis, and hemolytic uremia syndrome.

We calculate using Lindley equation and EIIP.nW

Page 10: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Example of Amino Acid Scores and the Stop Codon Score (1)

Score = EIIP - 0.0885

Negative Score

Positive Score

Stop Codon Score-2 × 0.0085

AminoAcid ScoreLeu(L) - 0.0085Ile(I) - 0.0085

Asn(N) - 0.0849Gly(G) - 0.0835Val(V) - 0.0828Glu(E) - 0.0827Pro(P) - 0.0687His(H) - 0.0643Lys(K) - 0.0514Ala(A) - 0.0512Tyr(Y) - 0.0369Trp(W) - 0.0337Gln(Q) - 0.0124Met(M) - 0.0062Ser(S) - 0.0056Cys( )C - 0.0056Thr(T) 0.0056Phe(F) 0.0061Arg( )R 0.0074Asp(D) 0.0378

StopCodon- 0.1064- 0.177

Page 11: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Example of Amino Acid Scores and the Stop Codon Score (2-1)

Score = EIIP – 0.0045

Negative Score

Positive Score

Stop Codon Score

-2 × 0.0445

AminoAcid ScoreLeu(L) - 0.0445Ile(I) - 0.0445

Asn(N) - 0.0409Gly(G) - 0.0395Val(V) - 0.0388Glu(E) - 0.0387Pro(P) - 0.0247His(H) - 0.0203Lys(K) - 0.0074Ala(A) - 0.0072Tyr(Y) 0.0071Trp(W) 0.0103Gln(Q) 0.0316Met(M) 0.0378Ser(S) 0.0384Cys( )C 0.0384Thr(T) 0.0496Phe(F) 0.0501Arg( )R 0.0514Asp(D) 0.0818

StopCodon- 0.1064- 0.089

Page 12: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Example of Amino Acid Scores and the Stop Codon Score (2-2)

AminoAcid ScoreLeu(L) - 0.0445Ile(I) - 0.0445

Asn(N) - 0.0409Gly(G) - 0.0395Val(V) - 0.0388Glu(E) - 0.0387Pro(P) - 0.0247His(H) - 0.0203Lys(K) - 0.0074Ala(A) - 0.0072Tyr(Y) 0.0071Trp(W) 0.0103Gln(Q) 0.0316Met(M) 0.0378Ser(S) 0.0384Cys( )C 0.0384Thr(T) 0.0496Phe(F) 0.0501Arg( )R 0.0514Asp(D) 0.0818

StopCodon- 0.1064- 0.178

Change the Stop Codon Score.-0.089 → -0.178

(-4 × 0.0445)

Page 13: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Threshold of Amino Acid Sequence

may become high by chance in the region that is meaningless at an amino acid sequence.

The threshold is used in order to distinguish from meaningless regions.

The score sequence of an amino acid sequence assumes that it is independent and identically distribution.

can be considered to be the waiting time of GI/GI/1 queuing system.

nW

nS

nW

Page 14: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Threshold and the Probability that will exceed the Threshold accidentally

The probability that will exceed (Threshold) by chance is 0.05.

pnW

0w

for any then

xn exWP ][

}1][:0sup{ nsSeEs/log0 pw 10 ppwWP n ][ 0

The waiting time GI/GI/1 queuing system fills the following inequalities.

is the probability judged to be a meaningful sequence although it is a meaningless sequence.

p

nW

Page 15: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Distinction of gene coding regions and junk regions by Threshold

Page 16: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Similarity Comparison Experiment

The target sequence of this experiment is the genome data of human - and -Hemoglobins.

Hemoglobin is contained in erythrocyte and consists of a “hem” containing iron, and a “globin” which is protein, and has the important role of carrying oxygen inside of the body.

We calculate using Lindley equation and EIIP.

nW

Page 17: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Sequences of Human - and -Hemoglobins The genome data that we use is a gene coding region of Human - and -Hemoglobins.

A gene coding region of Human -Hemoglobin

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

A gene coding region of Human -Hemoglobin

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Page 18: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Amino Acid and the Stop Codon Scores

AminoAcid AminoAcidLeu(L) Tyr(Y)Ile(I) Trp(W)

Asn(N) Gln(Q)Gly(G) Met(M)Val(V) Ser(S)Glu(E) Cys( )CPro(P) Thr(T)His(H) Phe(F)Lys(K) Arg( )RAla(A) Asp(D)

0.02290.0291

Stop Codon Score - 0.1064

- 0.0475- 0.0474- 0.0334- 0.029

0.02970.0297

Score Score- 0.00160.0016

- 0.0532- 0.0532- 0.0496- 0.0482

- 0.0161- 0.0159

0.04090.04140.04270.0731

EIIP - 0.0532

-2 × 0.0532

Page 19: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Calculation Results of in -Hemoglobin and -Hemoglobin nW

Hemoglobin Hemoglobin

Page 20: Genomic Sequence Analysis using Electron-Ion Interaction Potential

The difference (absolute value) of calculation results of in -Hemoglobin and -Hemoglobin

0

0.05

0.1

0.15

0.2

0.25

0.3

1 9 17 25 33 41 49 57 65 73 81 89 97 105

113

121

129

137

Wn

The Difference of α - ,β - Hemoglobins.

The Difference of α - Hemoglobin andRandom Sequence β - Hemoglobin.The Average is 0.03874.

The Average is 0.061567.

nW

Page 21: Genomic Sequence Analysis using Electron-Ion Interaction Potential

Conclusion

We could find the gene regions from the DNA sequence by Lindley equation and EIIP.

We could show a technique of similarity comparison which shortened the processing time by Lindley equation and EIIP.