poisson approximation and viral replication
TRANSCRIPT
Poisson Approximation and Viral Replication
Ming-Ying Leung Department of Mathematical Sciences
UTEP
Outline:• Law of small numbers• Chen-Stein method of Poisson approximation• DNA and viral replication origins• Computational prediction of replication origins • Role of Poisson process approximation
Law of Small Numbers (Poisson, 1837)
1
Let , ,..., be i.i.d. Bernoulli 1 2random variables with success
probability and let .n
ii
X X Xn
p W X=
=∑
If and 0 in such a way that 0,then for 0,1, 2,...
n p npk
λ→ ∞ → = >=
( ) (1 ) ( )!
where is a Poisson random variable with parameter .
kn k n kP W k p p e P Y kk k
Yλ
λλλ
λ
⎛ ⎞ − −⎟⎜= = − → = =⎟⎜ ⎟⎜ ⎟⎝ ⎠
The Chen-Stein Method (Chen, 1970)Let be an index set. For any , is a Bernoulli random variable with success probability and ( ) is a subset of containing , called the neighborhood of dependence of . Let
and I
I I Xp B I
W X
α
α
αα
αα
α α
∈
∈
=∑ be a Poisson random variable with parameter
.I
Y
p
λ
αα
λ∈
=∑
{ }
1( )
2( )\
3
Define
where [ ]
where [ | , ( )]
I B
I B
I
b p p
b p p E X X
b s s E E X p X B
α βα β α
αβ αβ α βα β α α
α α α α γα
γ α
∈ ∈
∈ ∈
∈
=
= =
= = − ∉
∑ ∑
∑ ∑
∑
1 2 31 2then ( , ) ( ) min 1,TV
ed W Y b b bλ
λ λ λ
− ⎛ ⎞−≤ + + ⎜ ⎟
⎝ ⎠
DNA• DNA is deoxyribonucleic acid, made
up of 4 nucleotide bases – Adenine (A)– Cytosine (C) – Guanine (G)– Thymine (T)
• The bases A and T form a complementary pair, so are C and G.
G
AC
T
G
C
T
A
Virus and Eye DiseasesCMV Particle
CMV Retinitis• inflammation of the retina • triggered by CMV particles• may lead to blindnessGenome size
~ 230 kbp
Replication Origins and Palindromes
• High concentration of palindromes exists around replication origins of other herpesviruses
• Locating clusters of palindromes (above a minimal length) on CMV genome sequence might reveal likely locations of its replication origins.
Palindromes in Letter Sequences
Odd Palindrome:“A nut for a jar of tuna”
ANUTFORA AROFTUNAJ
remove spaces and capitalize
Even Palindrome:“Step on no pets”
STEPON NOPETS
DNA Palindrome: A string of nucleotide bases that reads the same as its reverse complement. A DNA palindrome must be even in length, e.g., palindrome of length 10:
5’ ….. GCAATATTGC …..3’ 3’ .…. CGTTATAACG …..5’
j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L
We say that a palindrome of length 2L occurs at position j when the (j-i+1)st and the (j+i)th bases are complementary to each other for i=1,…, L. In an i.i.d. sequence model this occurs with probability ( )2
LA T C Gp p p p⎡ ⎤+⎣ ⎦ .
Computational Prediction of Replication Origins
• Palindrome distribution in a random sequence model
• Criterion for identifying statistically significant palindrome clusters
• Evaluate prediction accuracy• Try to improve…
The Scan Statistic
X1, X2, …, Xn ∼ i.i.d. Uniform (0,1) Si = X(i+1) - X(i) = i th spacing Ar(i) = Si + … + Si+r-1 = sum of r adjoining spacing r-Scan Statistic ( )minr r
iA A i=
The d1 distance
1
Consider the real valued function : where
: 0, [0,1]i
n
x ii
f K R
K n xδ=
→
⎧ ⎫= ≥ ∈⎨ ⎬⎩ ⎭∑
1 1 2
1 11 1 2 21 2
is called the configuration space of [0,1].
Let ( , ) denote the distance between two configurations ( ,..., ) and ( ,..., ) such thatm n
dy y y yξ ξ
ξ ξ= =
1 1 21 2 ( )
1
1 if ( , ) 1min if
where the minimum is taken over all permutations of (1,2,..., ).
n
i ii
m nd
y y m nn
n
π
ξ ξ
=
≠⎧⎪= ⎧ ⎫⎨ − =⎨ ⎬⎪ ⎩ ⎭⎩
∑
The Wasserstein or d2 distance
{ }2 ( )
For two random processes and , the Wasserstein distance is
( , ) sup | ( ) ( ) |: 1, 1Lip K
X Y
d X Y Ef X Ef Y f f∞
= − ≤ ≤
1 21 2( )
1 1 2
where sup ( )
( ) ( )and sup :
( , )
K
Lip K
f f
f ff K
d
ξ ξ
ξ ξξ ξ
ξ ξ
∈∞=
⎧ ⎫−= ≠ ∈⎨ ⎬
⎩ ⎭
Measures of Prediction Accuracy
Attempts to improve prediction accuracy by:• Adopting the best possible approximation to
the scan statistic distribution• Taking the lengths of palindromes into
consideration when counting palindromes• Using a better random sequence model
Related Work in Progress
• Finding the palindrome distribution on Markov random sequences
• Investigating other sequence patterns such as close repeats and inversions in relation to replication origins
Other Mathematical Topics in Bioinformatics
• Optimization Techniques – prediction of molecular structures
• Differential Equations – molecular dynamics• Matrix Theory – analyzing gene expression
data• Fourier Analysis – proteomics data
Acknowledgements
CollaboratorsLouis H. Y. Chen (National University of Singapore)David Chew (National University of Singapore) Kwok Pui Choi (National University of Singapore)Aihua Xia (University of Melbourne, Australia)
Funding SupportNIH Grants S06GM08194-23, S06GM08194-24, and 2G12RR008124 NSF DUE9981104 W.M. Keck Center of Computational & Struct. Biol. at Rice University National Univ. of Singapore ARF Research Grant (R-146-000-013-112) Singapore BMRC Grants 01/21/19/140 and 01/1/21/19/217