poisson approximation and viral replication

21
Poisson Approximation and Viral Replication Ming-Ying Leung Department of Mathematical Sciences UTEP Outline: Law of small numbers Chen-Stein method of Poisson approximation DNA and viral replication origins Computational prediction of replication origins Role of Poisson process approximation

Upload: others

Post on 12-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Poisson Approximation and Viral Replication

Ming-Ying Leung Department of Mathematical Sciences

UTEP

Outline:• Law of small numbers• Chen-Stein method of Poisson approximation• DNA and viral replication origins• Computational prediction of replication origins • Role of Poisson process approximation

Law of Small Numbers (Poisson, 1837)

1

Let , ,..., be i.i.d. Bernoulli 1 2random variables with success

probability and let .n

ii

X X Xn

p W X=

=∑

If and 0 in such a way that 0,then for 0,1, 2,...

n p npk

λ→ ∞ → = >=

( ) (1 ) ( )!

where is a Poisson random variable with parameter .

kn k n kP W k p p e P Y kk k

λλλ

λ

⎛ ⎞ − −⎟⎜= = − → = =⎟⎜ ⎟⎜ ⎟⎝ ⎠

The Chen-Stein Method (Chen, 1970)Let be an index set. For any , is a Bernoulli random variable with success probability and ( ) is a subset of containing , called the neighborhood of dependence of . Let

and I

I I Xp B I

W X

α

α

αα

αα

α α

=∑ be a Poisson random variable with parameter

.I

Y

p

λ

αα

λ∈

=∑

{ }

1( )

2( )\

3

Define

where [ ]

where [ | , ( )]

I B

I B

I

b p p

b p p E X X

b s s E E X p X B

α βα β α

αβ αβ α βα β α α

α α α α γα

γ α

∈ ∈

∈ ∈

=

= =

= = − ∉

∑ ∑

∑ ∑

1 2 31 2then ( , ) ( ) min 1,TV

ed W Y b b bλ

λ λ λ

− ⎛ ⎞−≤ + + ⎜ ⎟

⎝ ⎠

DNA• DNA is deoxyribonucleic acid, made

up of 4 nucleotide bases – Adenine (A)– Cytosine (C) – Guanine (G)– Thymine (T)

• The bases A and T form a complementary pair, so are C and G.

G

AC

T

G

C

T

A

Genes and Genome

DNA Replication

Virus and Eye DiseasesCMV Particle

CMV Retinitis• inflammation of the retina • triggered by CMV particles• may lead to blindnessGenome size

~ 230 kbp

Replication Origins and Palindromes

• High concentration of palindromes exists around replication origins of other herpesviruses

• Locating clusters of palindromes (above a minimal length) on CMV genome sequence might reveal likely locations of its replication origins.

Palindromes in Letter Sequences

Odd Palindrome:“A nut for a jar of tuna”

ANUTFORA AROFTUNAJ

remove spaces and capitalize

Even Palindrome:“Step on no pets”

STEPON NOPETS

DNA Palindrome: A string of nucleotide bases that reads the same as its reverse complement. A DNA palindrome must be even in length, e.g., palindrome of length 10:

5’ ….. GCAATATTGC …..3’ 3’ .…. CGTTATAACG …..5’

j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L

We say that a palindrome of length 2L occurs at position j when the (j-i+1)st and the (j+i)th bases are complementary to each other for i=1,…, L. In an i.i.d. sequence model this occurs with probability ( )2

LA T C Gp p p p⎡ ⎤+⎣ ⎦ .

Association of Palindrome Clusters with Replication Origins

Computational Prediction of Replication Origins

• Palindrome distribution in a random sequence model

• Criterion for identifying statistically significant palindrome clusters

• Evaluate prediction accuracy• Try to improve…

The Scan Statistic

X1, X2, …, Xn ∼ i.i.d. Uniform (0,1) Si = X(i+1) - X(i) = i th spacing Ar(i) = Si + … + Si+r-1 = sum of r adjoining spacing r-Scan Statistic ( )minr r

iA A i=

Poisson Process Approximation of Palindrome Distribution

The d1 distance

1

Consider the real valued function : where

: 0, [0,1]i

n

x ii

f K R

K n xδ=

⎧ ⎫= ≥ ∈⎨ ⎬⎩ ⎭∑

1 1 2

1 11 1 2 21 2

is called the configuration space of [0,1].

Let ( , ) denote the distance between two configurations ( ,..., ) and ( ,..., ) such thatm n

dy y y yξ ξ

ξ ξ= =

1 1 21 2 ( )

1

1 if ( , ) 1min if

where the minimum is taken over all permutations of (1,2,..., ).

n

i ii

m nd

y y m nn

n

π

ξ ξ

=

≠⎧⎪= ⎧ ⎫⎨ − =⎨ ⎬⎪ ⎩ ⎭⎩

The Wasserstein or d2 distance

{ }2 ( )

For two random processes and , the Wasserstein distance is

( , ) sup | ( ) ( ) |: 1, 1Lip K

X Y

d X Y Ef X Ef Y f f∞

= − ≤ ≤

1 21 2( )

1 1 2

where sup ( )

( ) ( )and sup :

( , )

K

Lip K

f f

f ff K

d

ξ ξ

ξ ξξ ξ

ξ ξ

∈∞=

⎧ ⎫−= ≠ ∈⎨ ⎬

⎩ ⎭

Measures of Prediction Accuracy

Attempts to improve prediction accuracy by:• Adopting the best possible approximation to

the scan statistic distribution• Taking the lengths of palindromes into

consideration when counting palindromes• Using a better random sequence model

Related Work in Progress

• Finding the palindrome distribution on Markov random sequences

• Investigating other sequence patterns such as close repeats and inversions in relation to replication origins

Other Mathematical Topics in Bioinformatics

• Optimization Techniques – prediction of molecular structures

• Differential Equations – molecular dynamics• Matrix Theory – analyzing gene expression

data• Fourier Analysis – proteomics data

Acknowledgements

CollaboratorsLouis H. Y. Chen (National University of Singapore)David Chew (National University of Singapore) Kwok Pui Choi (National University of Singapore)Aihua Xia (University of Melbourne, Australia)

Funding SupportNIH Grants S06GM08194-23, S06GM08194-24, and 2G12RR008124 NSF DUE9981104 W.M. Keck Center of Computational & Struct. Biol. at Rice University National Univ. of Singapore ARF Research Grant (R-146-000-013-112) Singapore BMRC Grants 01/21/19/140 and 01/1/21/19/217