decomposing statistical periodicitiesraman/publications_files/arora.ssp.07.pdf · locating hidden...

DECOMPOSING STATISTICAL PERIODICITIES

Raman Arora and William Sethares

Department of Electrical and Computer Engineering, University of Wisconsin-Madison,Madison, WI 53706-1691 USA. [email protected], [email protected]

ABSTRACTNonstationary random symbolic sequences are investigatedfor cyclostationarity and decomposition into constituent cy-clostationary sources. Since the symbolic sources do notadmit an algebraic structure, a composition of distribu-tions of cyclostationary sources is studied that models theerosion in symbolic sequences, for instance, mutations ingene sequences. This composition gives a rich mathematicalstructure on the collection of cyclostationary sources anda uniqueness theorem for decomposition of pq-statisticallyperiodic symbolic sources into cyclostationary sources withperiods p and q, when p and q are coprime.

Index Terms— Cyclostationarity, symbolic periodicity,symbolic time series, genomic signal processing.

I. INTRODUCTION

SYMBOLIC sequences are time series defined on afinite set with no algebra. In DNA sequences, economic

indicator data, and other nominal time series, the onlymathematical structure is set membership [1]. An interestingand important behaviour symbolic sequences may exhibitis periodicity and finding these periodicities is fundamentalto the understanding and determination of the structure ofthe sequences. In genomic signal processing, for instance,locating hidden periodicities in DNA sequences is a veryimportant task [2]. Repetitions in DNA have been shownto be correlated with several structural and functional roles.For example, base (symbol) periodicity of 21 is associatedwith α-helical formation for synthesized protein molecules[2] and base periodicity of 3 is identified with protein codingregion of the DNA. Such investigations also find applicationsin diagnosis of genetic disorders (Huntington’s disease [3]),DNA forensics, and reconstructing evolutionary history [4].Symbolic periodicity in DNA sequences comes in several

flavors: homologous, eroded, and latent [5]. Homologousperiodicities occur when short fragments of DNA are re-peated in tandem to give periodic sequences. Imperfect oreroded periodicities [6] result when some of the bases in thehomological periodic sequence get replaced or altered so thatthe tandem repeats are no longer perfect. Latent periodicities[7], [8] occur when the repeating unit is not a fixed sequencebut may change in a patterned way (for instance, a DNA

sequence in which the nth element is always either A or G)or when there are indels (insertions and deletions) in homo-logical periodic sequences. This taxonomy of periodicitiesis not specific to DNA sequences; it applies to any symbolicsequence.Most current approaches to detecting periodicities trans-

form the symbolic sequences into a numerical sequence[8], [9], [10]; these techniques are primarily aimed at thedetection of homological periodicities [4], [11], [12]. Ageneral approach to the detection of these three classesof periodicities was presented in [13] using a maximumlikelihood formulation. In this approach, each element inthe DNA sequence is assumed to be generated from aninformation source with an underlying probability massfunction (pmf). The number of sources defines the periodand the symbols are drawn from these sources in a cyclicmanner. Thus, periodicities in the symbols are representedby repetitions of the pmfs. This can be pictured as in Fig. 1.A rotating carousel (labeled A) contains NA urns, each withits own distribution of balls (which are labeled A, G, C, orT ). At each timestep, a ball is drawn from the urn and thecarousel rotates one position. The output of the process isnot periodic; rather, the distribution from which the symbolsare chosen is periodic. This is called statistical periodicityor strict sense cyclostationarity [14].

AB

Fig. 1. Each time a ball is removed from one of the NA

urns (indicated by the arrow), platform A rotates, bringinga new urn into position. Similarly, carousel B contains NB

urns, each with its own collection of balls. Draws are madeby combining draws from the two aligned urns and resultsin a NANB statistical periodicity.

1951-4244-1198-X/07/$25.00 ©2007 IEEE SSP 2007

Symbolic random variables take values on a finite setcalled the alphabet whose elements are called symbols. Itis possible to map the symbols to numbers or to define analgebra on the alphabet, but this would impose mathematicalstructure that was not present to begin with. For instance, themapping of DNA elements (T = 0, C = 1, A = 2, G = 3),suggested in [15], puts an order on the set; the complex rep-resentation (A = 1+ j,G = −1+ j, C = −1− j, T = 1− j)used in [12], [8] implies that the euclidean distance betweenA and C is greater than the distance between A and T [16].A good survey of various numerical representations for DNAsequences is presented in [17]. Artifacts of such mappingsare reported in [11].In contrast, the formulation in this paper implies no

mathematical structure on the alphabet. The focus is onan investigation of multiple periodicities: ways that severalperiodic symbolic sequences may combine to generate newsequences. In DNA sequences, multiple periodicities havebeen observed within coding regions [6]. For example, latentperiodicities of 120 bases and 126 bases were reported invarious genes in [2]. Such longer periods that are multiplesof 3 occur in coding regions which have a characteristicperiod of three. As noted by Korotkov et. al [6], such peri-odicities can be related to evolutionary origins via multipleduplications.Multiple periodicities in symbolic random sequences are

investigated by defining compositions on the probabilitymeasures associated with the sequences. One possibility isto form a Bernoulli mixture of two symbolic sequences; atevery instant of time picking symbol from a sequence withprobability β and from the other sequence with probability1−β. If pt and qt denote the distributions over the alphabetfor the two sequences at instant t, the distribution for thecomposed sequence is given as βpt + (1 − β)qt. If thedistributions pt and qt exhibit periodicities, the Bernoullimixture may exhibit multiple periodicities. The composi-tion of distributions arises naturally from the underlyingexperiment, in this case the Bernoulli mixture, and its easilyverified that it does not imply an algebraic structure on thealphabet.This paper focuses on a (different) method of composi-

tion that can be interpreted as an erosion or mutation ingenes. The corresponding physical experiment is illustratedin Fig. 1, which contains two rotating carousels A andB with NA and NB urns respectively. At each timestep,the two carousels rotate into position and an element isdrawn from each of the two aligned urns (indicated by thebrackets). If the two drawn elements have the same label,the output assumes that label. If the draws give balls withdifferent labels, they are returned to the urns. This continuesuntil an identical pair is drawn. The urns then rotate andthe process repeats. The motivation for this model comesfrom the DNA replication model as discussed in Sect. II.Sect. III shows that this method of composition gives a

rich mathematical structure in which to study statisticalperiodicities with multiple hidden periodicities. The figureshows how two cyclostationary sequences with periods NA

and NB may combine to form a new sequence with periodNANB .Sect. IV investigates the inverse problem: given a cyclo-

stationary symbolic sequence, how can it be decomposedinto constituent cyclostationary subsequences. These inves-tigations provide a structured way of attacking the problemof locating such hidden periodicities. While the DNA se-quencing application provides motivation for this work, theunderlying mathematics is general enough to easily includeany symbolic set with any (finite) number of elements.

II. DNA REPLICATION AND MUTATION

DNA Polymerase

Helicase

Lagging

strand

Leadingstrand

Fig. 2. DNA replication: The two strands of the DNA forkand the polymerase recreates the DNA by attaching thecomplementary bases to each separated strand thus creatingtwo copies of the original DNA sequence [18].

DNA is a long polymer composed of four repeating basesor nucleotides: A (adenine), G (guanine), T (thymine),and C (cytosine). It exists as a tightly entwined pair ofstrands in the shape of a helix. The two strands are heldtogether by hydrogen bonds between complementary bases:the nucleotide A in one strand bonding only with T of theother and G bonding only with C. The changes in the base-pair sequences of DNA are mutations which may occur whenthe DNA is being replicated.DNA replication begins with helical unwinding. The two

strands are pulled apart like a zipper as shown in Fig. 2,resulting into two separate strands. The DNA sequence ofthe forked strands is recreated by the enzyme polymerasein accordance with rules of complementary base pairing.However, substitution errors may occur, and the error rateis approximately 10−3 to 10−5 [19]. The overall mutation

196

rate is reported to be 10−10, due largely to the inbuilt proof-reading mechanism of the replication process explained next.A substitution error in the replication process causes a

kink in the DNA sequence due to an imbalance of thesizes of the purines (A, G) and the pyrimidines (C, T ).If a mismatch is detected, the replication stops till thepolymerase restores the correct nucleotide [16]. This DNAevolution (or mutation) process provides the motivation forthe two-carousel-model above. The original DNA sequencecan be modeled as a sequence of non-stationary randomvariables. Often these sequences exhibit cyclostationarity,for instance the 3-base-pair periodicity is characteristic ofthe exons. The complementary nucleotides generated by thepolymerase (when recreating the forked strands) are modeledas a second sequence of nonstationary random variables.The two sequences may have the same distributions inthe absence of the mutations. If there are mutations, thedistribution of the secondary sequence is affected and theindels may cause the secondary sequence to exhibit differentstatistical periodicity. led as a second sequence of nonsta-tionary random variables. The two sequences may have thesame distributions in the absence of the mutations. If thereare mutations, the distribution of the secondary sequence isaffected and the indels may cause the secondary sequenceto exhibit different statistical periodicity.Finally, the composition of pmfs of the two sequences

captures the proof-reading mechanism in the DNA replica-tion which progresses only if bases drawn from the twosources are complementary. The two carousel model ofFig. 1 provides a simplified analogy: the former defines anevent as identical balls drawn from the two urns, the latterdefines an event as complementary base pairs attached to twostrands. The analogy is strengthened since each nucleotideuniquely determines the complementary base.

III. PERIODIC SUBSPACESLet A = {a1, . . . , aM} be a finite set with cardinality

M . Let X be an A-valued random variable with probabilitymass function (pmf) μX , i.e. for a ∈ A, μX(a) denotes theprobability P (X = a). Let X denote the collection of allrandom variables on the alphabet A. For n, p ∈ Z

+ let n̂p

denote the positive integer 1 + ((n− 1) mod p).Define a symbolic (random) sequence taking values on

the set A to be a sequence of independent random variablesS : Z

+ → X. The symbolic sequence S is said to be p-statistically periodic if p is a positive integer such that therandom variables Sn and Sn̂p

are identically distributed forall n ∈ Z

+. Let Pp be the collection of p−statisticallyperiodic sequences. Then P =

⋃p∈Z+ Pp is the set of all

statistically periodic sequences of random variables on thealphabet A. A sequence S in Pp can also be described byan M × p column-stochastic matrix QS whose ith column,denoted qS

i , gives the pmf of Snp+i for all n ∈ Z+, i.e.

P (Snp+i = aj) = P (Si = aj) = QSji ≡ qS

i (j) (1)

where j = 1, · · · ,M . Therefore Pp can also be identified asthe set of allM×p column stochastic matrices. With a slightabuse of notation, the elements of Pp may be referred to as arandom symbolic sequence X or as the corresponding pmfQX . The law of composition on the pmfs of the randomsymbolic sequences that captures the experiment in Fig. 1defines

⊕ : P × P → P(X,Y ) �→ Z

(2)

on P as follows. Let X,Y ∈ P be sequences with statisticalperiodicities p and q respectively. Then Z = X ⊕ Y is thesequence of random variables such that for all a ∈ A

P

(Zn = a

)= P

(Xn̂p

= a, Yn̂q= a

∣∣∣∣Xn̂p= Yn̂q

). (3)

Again, this is a slight abuse of notation since the binary op-eration is defined on the matrices QX ,QY but is expressedin terms of the symbolic sequences X,Y .

Lemma 1. Let X ∈ Pp and Y ∈ Pq. Let Z = X⊕Y . ThenZ ∈ Pr, where r is the lowest common multiple of p and q.

Proof: Let m = n + rs where r is the lowest commonmultiple of p and q and s is any positive integer. Then m̂p =

n̂p and m̂q = n̂q. Thus for all a ∈ A, P(Zm = a

)=

P(Xn̂p

= a, Yn̂q= a

∣∣∣Xn̂p= Yn̂q

)= P

(Zn = a

).

Corollary 1. Let X,Y ∈ Pp. Then X ⊕ Y is p-statisticallyperiodic.

In Lemma 1, if p and q are mutually prime then Z ∈ Ppq.If QX ,QY and QZ denote the stochastic matrices of X,Yand Z, respectively, then by definition (3), the nth columnof the M × pq matrix QZ is

qZn =

1

C

⎡⎢⎣

qXn̂p

(1)qYn̂q

(1)...

qXn̂p

(M)qYn̂q

(M)

⎤⎥⎦ (4)

where C =∑M

j=1 qXn̂p

(j)qYn̂q

(j) is the normalization factor.

Example 1. Consider an example of composition of twocyclostationary sources with statistical periods 2 and 3. Eqn.(4) gives⎡⎢⎢⎣

.25 .6

.25 .2

.25 .1

.25 .1

⎤⎥⎥⎦

︸︷︷︸X∈P2

⊕

⎡⎢⎢⎣

.3 .1 10 .1 0.3 .2 0.4 .6 0

⎤⎥⎥⎦

︸︷︷︸Y ∈P3

=

⎡⎢⎢⎣

0.3 0.375 1 0.72 0.1 10 0.125 0 0 0.1 0

0.3 0.125 0 0.12 0.2 00.4 0.375 0 0.16 0.6 0

⎤⎥⎥⎦

︸︷︷︸Z∈P6

197

Note that the first source in the sequence X is the identity ofthe composition and the last source of the sequence Y actslike an infinity of the operation.

If X = Y , then Z = X ⊕ Y is in Pp with

qZn (k) = (qX

n (k))2/

M∑j=1

(qXn (k))2,

for k = 1, . . . , M and n = 1, . . . , p. The operation of com-posing a symbolic sequence with itself can also be expressedas multiplication by the scalar 2; write Z = X⊕X = 2◦X .This definition can be extended to multiplication by anyscalar. For r ∈ R and X ∈ P define

◦ : R× P → P(r,X) �→ Z

(5)

so that Z = r ◦X is the random symbolic sequence with

P(Zn = a

)=

P (Xn = a)r∑b∈A P (Xn = b)r

(6)

for all a ∈ A with P (Xn = a) �= 0. When P (Xn = a) = 0,P (Zn = a) is defined to be 0. If X ∈ Pp, Z ∈ Pp.

Example 2. Consider an example of scalar multiplication.Let X be a cyclostationary symbolic sequence with Xi

distributed as qXi = [12

14

14 0]T . If Y = 2 ◦ X then Yi

is distributed as qYi = [23

16

16 0]T .

Theorem 1. The set P forms an abelian group under thebinary operation ⊕ : P × P → P .

Proof: The closure of P under ⊕ follows by Lemma 1and the operation is commutative by definition. Associativityis easy to check: let X,Y, Z ∈ P have statistical periodic-ities p, q and r respectively. Let V = X ⊕ (Y ⊕ Z) andW = (X ⊕ Y )⊕ Z. Then QV

ji can be rewritten as

QXjip

(QY

jiqQZ

jir

)∑

j QXjip

(QY

jiqQZ

jir

) =

(QX

jipQY

jiq

)QZ

jir∑j

(QX

jipQY

jiq

)QZ

jir

= QWji

for j = 1, . . . , M and i = 1, . . . , pq. The unique identityelement, denoted E, is the stationary or 1-statistically peri-odic random sequence such that P (E = aj) = 1

Mfor all

aj ∈ A. Finally, for X ∈ P if Y = (−1) ◦ X then it iseasy to verify that X ⊕ Y = E. Thus every X ∈ P has aninverse.

Theorem 2. (P,⊕, ◦) is a vector space over R.Proof: The closure of P under ◦ follows by definition

and the identity element is 1 ∈ R since 1 ◦ X = X . Thedistributive properties are easy to check: for α ∈ R, X ∈ Pp

and Y ∈ Pq , α◦(X⊕Y ) = (α◦X)⊕(α◦Y ) and for α, β ∈ R

and X ∈ Pp, (α + β) ◦ X = (α ◦ X) ⊕ (β ◦ X). Finally,scalar multiplication is compatible with multiplication in thefield of scalars: α ◦ (β ◦X) = (αβ) ◦X.

Corollary 2. For p ∈ Z+, Pp is a subspace of P .

IV. DECOMPOSING PERIODICITIES

A fundamental problem in symbolic signal processing[8] is identifying the periodic structure of the symbolicsources. Given a realization of the symbolic sequence, themaximum likelihood estimates of the statistical periodicityand the corresponding pmf were derived in [13]. This sectioninvestigates the problem of decomposing the discoveredsymbolic source into various smaller components.Assume that an observed sequence Z ∈ Ppq was orig-

inally composed of sequences X ∈ Pp and Y ∈ Pq, i.e.Z = X⊕Y . Then Zn = Xn̂p

⊕Yn̂q, for n = 1, . . . , pq. The

pq equations can be expressed in matrix form as

⎡⎢⎣

Z1

...Zpq

⎤⎥⎦

pq×1

=

⎡⎢⎣

Ip Iq

......

Ip Iq

⎤⎥⎦

︸︷︷︸Tpq×(p+q)

◦

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

X1

...Xp

Y1

...Yq

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

(p+q)×1

. (7)

Theorem 3. For mutually prime p and q, the matrix T abovehas rank p + q − 1. The null space of T is spanned by thevector [−1 . . .− 1︸︷︷︸

p

1 . . . 1︸︷︷︸q

]

Theorem 3 shows that if Z ∈ Ppq can be decomposedinto Z = X ⊕Y for some X ∈ Pp and Y ∈ Pq, then it canalso be decomposed as

(X ⊕ δp)⊕ (Y δq) = Z

where Y δq = Y ⊕ (−1 ◦ δq) and δr = [

r︷︸︸︷δ, . . . , δ] for some

δ ∈ P1 and r = p, q. Thus there is a class of decompositionsof Z. In words, a pq-periodic symbolic source Z can bedecomposed into p and q−periodic components X,Y uniqueup to an additive factor δ ∈ P1.

Example 3. With the same X and Y as in example 1,⎡⎢⎢⎣

2/10 12/233/10 6/233/10 3/232/10 2/23

⎤⎥⎥⎦

︸︷︷︸X′=X⊕δ

⊕

⎡⎢⎢⎣

1/3 1/9 10 2/27 0

2/9 4/27 04/9 2/3 0

⎤⎥⎥⎦

︸︷︷︸Y ′=Y�δ=Y⊕(−1◦δ)

=

⎡⎢⎢⎣

0.3 0.375 1 0.72 0.1 10 0.125 0 0 0.1 0

0.3 0.125 0 0.12 0.2 00.4 0.375 0 0.16 0.6 0

⎤⎥⎥⎦

︸︷︷︸Z

where δ =[

210

310

310

210

]T and −1 ◦ δ =[

310

210

210

310

]T

198

V. DISCUSSION

The investigation of multiple periodicities in symbolicsequences and their decomposition into smaller periodiccomponents may be useful as a way to understand theunderlying generative mechanism. For instance, in DNAsequences, the decomposition may provide insight into theunderlying evolutionary process that determines the structureof the sequences. The investigation is challenging at leastin part due to the lack of an algebraic structure. Theapproach used here models the symbolic sequence as anonstationary random process on a finite alphabet and thenstudies the (de)composition of the distributions. In particular,the decomposition of DNA sequences are studied under acomposition rule that is inspired by the biological model forgene replication and mutation.The formulation of the problem in this paper is different

from the classical stochastic techniques where distributionsare estimated by averaging over various ensembles or re-alizations. Often, it is impractical or impossible to obtainmore than one realization and an engineer’s solution isto perform averaging over single realization of data. Thistemporal averaging may be justified when the data exhibitscyclostationarity over long periods or when it is reasonableto assume ergodicity. An interesting discussion about the twoapproaches may be found in [20].

VI. REFERENCES

[1] Wei Wang and Don H Johnson, “Computing lineartransforms of symbolic signals,” IEEE Trans. On SignalProcessing, vol. 50, no. 3, pp. 628–634, March 2002.

[2] E V Korotkov and N Kudryaschov, “Latent periodicityof many genes,” Genome Informatics, vol. 12, pp. 437– 439, 2001.

[3] The Huntington’s Disease Collaborative ResearchGroup, “A novel gene containing a trinucleotide repeatthat is expanded and unstable on huntington’s diseasechromosomes,” Cell, vol. 72, pp. 971–983, March1993.

[4] Andrzej K. Brodzik, “Quaternionic periodicity trans-form: an algebraic solution to the tandem repeat de-tection problem,” Bioinformatics, vol. 23, no. 6, pp.694–700, Jan 2007.

[5] M B Chaley, E V Korotkov, and K G Skryabin,“Method revealing latent periodicity of the nucleotidesequences modified for a case of small samples,” DNAResearch, vol. 6, pp. 357 – 363, Feb. 1999.

[6] E V Korotkov and D A Phoenix, “Latent periodicityof DNA sequences of many genes,” in Proceedingsof Pacific Symposium on Biocomputing 97, 1997, pp.222–229.

[7] E V Korotkov and M A Korotova, “Latent period-icity of dna sequences of some human genes,” DNASequence, vol. 5, pp. 353, 1995.

[8] Dimitris Anastassiou, “Genomic signal processing,”IEEE Signal Processing, vol. 18, pp. 8–20, Jul 2001.

[9] V R Chechetkin, L A Knizhnikova, and A Yu Turygin,“Three-quasiperiodicity, mutual correlatuions, orderingand long modulations in genomic nucleotide sequencesviruses,” Journal of biomolecular structure and dynam-ics, vol. 12, pp. 271, 1994.

[10] E A Cheever, D B Searls, W Karunaratne, and G COverton, “Using signal processing techniques for dnasequence comparison,” in Proc. of the 1989 FifteenthAnnual Northeast Bioengineering Conference, Boston,MA, Mar 1989, pp. 173 – 174.

[11] Ravi Gupta, Divya Sarthi, Ankush Mittal, and KuldipSingh, “Exactly periodic subspace decompositionbased approach for identifying tandem repeats in dnasequences,” in Proc. of the 14th European SignalProcessing Conference (EUSIPCO), Florence, Italy,Sep 2006.

[12] Marc Buchner and Suparerk Janjarasjitt, “Detectionand visualization of tandem repeats in dna sequences,”IEEE Transactions on Signal Processing, vol. 51, no.9, pp. 2280–2287, Sep 2003.

[13] Raman Arora and William A. Sethares, “Detection ofperiodicities in gene sequences: a maximum likelihoodapproach,” in GENSIPS (Genomic Signal Processingand Statistics), Tuusula, Finland, June 2007.

[14] W. A. Gardner, A. Napolitano, and L. Paura, “Cy-clostationarity: half a century of research,” SignalProcessing, vol. 86, pp. 639–697, 2006.

[15] P. D. Cristea, “Genetic signal representation and analy-sis,” in Proceeding SPIE Conference, InternationalBiomedical Optics Symposium (BIOS02), 2002, pp. 77–84.

[16] Gail L. Rosen, Signal Processing for biologically-inspired gradient source localization and DNA se-quence anlysis, PhD Thesis, Georgia Institute ofTechnology, 2006.

[17] Mahmood Akhtar, Julien Epps, and Eliathamby Am-bikairajah1, “On dna numerical representations forperiod-3 based exon prediction,” in GENSIPS (Ge-nomic Signal Processing and Statistics), Tuusula, Fin-land, June 2007.

[18] DNA, [Online]. Available on Wikipedia athttp://en.wikipedia.org/wiki/DNA.

[19] Roy H Burdon, Genes and the Environment, Taylorand Francis Inc., PA, 1999.

[20] William A. Gardner, Cyclostationarity in Communica-tions and Signal Processing, IEEE press, NY, 1994.

199

decomposing statistical periodicitiesraman/publications_files/arora.ssp.07.pdf · locating hidden...

Documents