molecular evolution, cont. comparing estimation methods. application to human and mouse sequences

1

Molecular evolution, cont.Comparing estimation methods.

Application to human and mouse sequences

Lecture 16, Statistics 246

March 16, 2004

2

The resolvent method

Müller and Vingron (2000) proposed a fast estimation method for sequence pairs based on resolvents. It can in fact be applied to multiple aligments generated by a reversible Markov process. For > 0, the resolvent R of a rate matrix Q is given by

R = (I - Q)-1.

Solving for Q gives the following formula:

Q = I - R-1

.

It turns out that the resolvent is the Laplace transform of the transition matrices,

€

Rα = e−αtP(t)dt0

∞

∫ .

3

Resolvent method, cont.

This is the key formula. (Prove it.) Given many pairs of sequences, not necessarily disjoint, that are separated by t PAMs, an unbiased estimate of P(t) can be obtained by normalizing the symmetrized sum of frequency tables. If P(t) can be estimated for a wide range of t, then we can get an estimate of Q via the last two equations. That is the idea.

In practice, there are two issues: (i) the distances are unknown, and so must be estimated by ML, and (ii), the estimated distances are discrete, and so interpolation must be used to estimate the rate matrix.

Let the estimated distances be 0 < t1 <….< tN . Then the integral is approximately equal to the sum of N pieces:

4

Resolvent method, cont.

which can be evaluated exactly, after replacing the Ps by their estimates. Summing these integrals gives an estimate of R, and by inversion, of Q.

€

Rα ≈ ( +.....tN −1

tN∫ )e−αtP(t)dt.0

t1∫The kth int egral is approximated by linear int erpolation

e−αt (P(tk−1) +t − tk−1

tk − tk−1tk−1

tk∫ [P(tk ) − P(tk−1)]),

5

REVIEW: BLOck SUbstitution Matrices (BLOSUM)

Henikoff and Henikoff (1992) used an ad hoc method that takes time inhomogeneity into account to construct the BLOSUM (block substitution matrix) matrices. The input is a set of blocks, which are gap-free multiple alignments of segments of homologous amino acid sequences. A frequency table is derived from the blocks by summing over the match and mismatch patterns from all within-block pair-wise comparisons. Since a mismatch such as an A aligned to an S, can be written in two ways, AS and SA, we get rid

of the ambiguity by using only AS. In general, a mismatch is represented by sequences XY, where X precedes Y alphabetically. For example, suppose

that in a block with six sequences, two columns are as follows: ..AD.. ..AD.. ..AE.. ..AE.. ..AD.. ..SD..

6

REVIEW: BLOSUM matrices, cont.

There are a total of 15 pairwise comparisons. The left column contributes 10 AA and 5 AS pairs to the frequency table. Similarly, the right column contributes 6 DD, 1 EE and 8 DE pairs. Adding these column contributions within the block, and then across all blocks, gives a triangular frequency table. The matrix is symmetrized by adding itself to its transpose. Dividing the matrix by its sum yields a symmetric joint distribution, and a substitution matrix is obtained as described for the PAM matrices.

To downweight the contribution of the more closely related sequences to the frequency table, the sequences within each block are clustered. Let be a fixed number between 0 and 100. Sequences that are more than % similar are “greedily” clustered. In other words, any two sequences that are more than % similar are put in the same cluster, and if each sequence already belongs to some cluster, then the two clusters are combined to form a larger cluster. In the end, the sequences within a block are partitioned into disjoint clusters, so that any two sequences from distinct clusters are less than % similar. It is clear that the clusters are well-defined, i.e., independent of the initial choice of sequences.

7

REVIEW: BLOSUM matrices, cont.

Sequences in the same cluster are downweighted by the cluster size in cross-cluster pairwise comparisons, and pairwise comparisons of sequence in the same cluster do not contribute to the frequency table. In the example, suppose that the first four sequences are clustered while the last two sequences are not. Then the contribution of the left column is the same as an A-A-S column: 1 AA, 2 AS pairs. The right column is effectively (D/E)-D-D, where D/E represents half a D and half an E. Its contribution is 2 DD (1 + 1/2 + 1/2) and 1 DE (1/2 + 1/2) pairs. Equivalently, sequences in the same cluster are replaced by an “average” sequence with fractional number of residues at each position. Then the frequency table is derived as if the average sequences are real sequences; blocks that have only one cluster are left out. Let the symmetric joint distribution be denoted by f . Let be the row of column sum of f . Then the substitution matrix BLOSUM is given by

S(, a, b) = 2log2{f(a,b)/(a)(b)}

8

REVIEW: BLOSUM, final remarks.

If is 100, then every cluster is of size 1, so f is an average of the substitution patterns over all distances. If is a small value, such as 20, then there are a small number of large clusters, and the similarity between the sequences from distinct clusters ranges from 0 to 20%, so f depends only on the substitution patterns of the distantly related sequences.

Thus reducing the threshold gives substitution matrices which are more suitable for aligning distantly related sequences. In this context, BLOSUM40 is to be preferred to BLOSUM80 when aligning distantly related protein sequences.

9

An inhomogeneous process

In order to compare methods of estimating rate matrices, we are going to define a simple class of inhomogeneous processes, i.e. processes for which no single rate matrix Q describes the evolution. Let Q1 be a calibrated reversible rate matrix, with all off-diagonal entries strictly positive, and let be its stationary distribution. Define a reversible matrix Q0 as follows:

Q0(a,b) = (b), if a b; Q0(a,a) = - ba(b), if a=b.

Clearly Q0 is the simplest reversible rate matrix with stationary distribution . Calibrate Q0 by scaling, and note that Q0 commutes with Q1. Consider the process Q defined by

Q(x) = Q0 if 0 ≤ x ≤ 100; Q(x) = Q1 if 100 ≤ x ≤ 200. Note that Q has stationary distribution , is reversible and calibrated. Indeed the same

would be true of the more general form

Q(x) = Q0 + g(x)[Q1 - Q0], x≥ 0, for any integrable g : [0, ) [0,1]. Using some (product integral) theory we omit, we can get well defined

inhomogeneous Markov processes having the Q(x) as instantaneous rate matrices at time x≥ 0. We use one further notion: a joint distribution f on amino acids that is embeddable in a continuous time Markov process can be written f = *exp(Q*t*) for a calibrated Q*, stationary distribution * and effective divergence time t*.

10

Simulation comparison of methods

We will compare the BLOSUM, resolvent and ML methods in the following way. Let {30, 35, …, 85}. Given sequence blocks generated by a substitution process, run BLOSUM at threshold (i.e. using sequences more than % similar) to get f with effective divergence time t*. Apply the resolvent and ML methods to get estimated rate matrices leading to fres and fmle. Define the deviations

dblo = d(f , f(t*))

dres = d(fres(t*), f(t*))

dmle = d(fmle(t*), f(t*)),

where d(, ) = a |(a) - (a)|. Suppose that the substitution model is fixed, i.e. the substitution process, the trees and the sequence lengths are all fixed. Then the average of the deviations over all realizations is the sequences is a measure of the relative performance of the methods.

11

Simulation comparison, cont.

Our simulations are based on a pair of rate matrices Q0 and Q1. Here Q1 is the rate matrix estimated by ML from 13,255 pairwise amino acid alignments of length at least 100, which were obtained from the SYSTERS database courtesy of Tobias Müller. Q0 is constructed from Q1 as indicated earlier. In every simulation 30 sequence pairs separated by the distances 10, 20, …, 300 PAMs are generated according to a substitution model. In other words, there are 30 blocks, and each block has 2 sequences. Also, all trees have depth 150 PAMs, so that a sequence pair separated by 100 PAMs have a most recent common ancestor 50 PAMs ago, and the most recent common ancestor of a sequence pair separated by 300 PAMs sits at the top of the tree. The simulations differ through the substitution models, coming next, and all sequences have length 5,000.

1) A reversible homogeneous process generated by Q1.

2) An inhomogeneous process with Q(t) = Q0 + (Q1 - Q0)t/150, 0 ≤ t ≤150.

3) An inhomogeneous process with

Q(t) = Q0 , 0 ≤ t ≤100; Q(t) = Q1 , 100 ≤ t ≤150.

12

Reversible homogeneous process

ML better than RES better than BLOSUM, note pattern

13

Smooth inhomogeneous process

ML still best, now BLOSUM next, and RES worst.

14

Discontinuous inhomogeneous process

At low thresholds, BLOSUM beats ML handily (as it should!)

15

Summary of comparison

For an inhomogeneous process, the BLOSUM method is rather more stable and less biased than either the resolvent or ML methods, at low thresholds, but not so good at high thresholds. BLOSUM is only slightly worse than ML and resolvent at high thresholds.

The simulations suggest that a) The BLOSUM method is a robust method for deriving

substitution matrices at large evolutionary distances; whileb) The ML and resolvent methods are better than the BLOSUM

(and ML appreciably better than resolvent) for deriving substitution matrices at small distances.

These conclusions have been reached by simulation, for pair data, but are probably true more widely.

16

Modelling genomic DNA base substitution

In the final part of this lecture, we look at estimates of rate matrices connecting human chromosome 22 sequence with the homologous mouse sequence. More precisely, a method previously described is applied to a large collection of local alignments, so-called high scoring pairs, or HSPs. We focus on intergenic regions, and gap-free HSPs, for simplicity. Our assumption is that in a gap-free alignment, every pair of aligned bases has evolved (descended) from a single ancestral base.

We examine the estimated rate matrices for strand-symmetry, and homogeneity across varying base composition. We find approximate strand-symmetry, but considerable heterogeneity. Grouping by GC content helps, but we still see evidence of non-stationarity.

17

Data preprocessing and summary

The essentially complete sequence of human chromosome 22 was aligned to the mouse sequence using blastz in Pipmaker, giving 47,166 HSPs. Those that intersected genes were removed, leaving 26,593 HSPs totally in the intergenic region. Those closer than 10 bp were merged, and the 10,638 of length greater than 100 bp were used.

Given an HSP, the count matrix is the 4x4 matrix that has the numbers of each of the 16 possible pairs of aligned bases in the HSP. The %identity is the sum of the diagonal counts divided by the total sum = the HSP length.

The percent non-identity is a monotonely increasing function of evolutionary distance, and approximately linear for small distances.

The two marginal frequencies of the count matrix are the base compositions of the human and mouse sequences, respectively. The base composition of an HSP is the average of these two, while the GC content is the fraction of baases that are either G or C. (Protein-coding genes tend to be located in the regions with higher GC-content.)

We now give plots of some of these quantities.

18

Plot of percent identity against position along chr 22 for each HSP

19

Plot of percent identity against position for the first 1,000 HSPs

Gaps are due to repetetive elements.

20

Plot of percent identity against position for the 5,001st to 6,000th HSP

21

Plot of human GC-content against position along chromosome 22

Pretty variable!

22

Scatterplot of human and mouse GC-content for each HSP

23

Comparing base compositions

If we denote the human and mouse base compositions for an HSP by h and m , then the difference in base compositions is

D = a |h -m |.

If the sequences are generated by a stationary model, then D 0 as the length n of the HSP increases, at approximately rate n-1/2.

The standardized quantity n-1/2D is a measure of composition imbalance between the two sequences making up an HSP.

24

Histogram of imbalance between human and mouse base compositions

The mean and SD of the histogram of imbalances simulated under a fitted model are 1.3 and 0.6 Some evidence of non-stationarity in substitution.

25

Plot of composition imbalance against position along chromosome 22

26

Estimating reversible calibrated rate matrices

We turn now to estimating rate matrices by ML. In principle, each position in each HSP could evolve at a different rate, but with just two species, we have no hope of discovering this. What we do is permit each HSP to have its own evolutionary distance in PAMs separating the human and mouse components. These are estimated by ML, along with the reversible calibrated rate matrices, initially a common one, and later we stratify by GC-content. The methods have been given in the last lecture.

27

First results We find that 10,000 times the MLE Q^ is A C G T A -99 19 62 18 C 20 -100 16 64 G 65 16 -101 20 T 18 63 19 -100 The transition rates (AG, CT, bold) are around 60, which is about 3 times the

transversion rates. The stationary distribution of Q^ is almost uniform: A G C T .26 .25 .24 .25 while the symmetric part, R^ is A C G T A -386 79 253 71 C 79 -407 65 254 G 253 65 -413 79 T 71 254 79 -396 The transition rates here are almost identical, but the transversion rates are more

variable. Clearly Q^ is nearly strand symmetric, i.e., the rate for AC is essentially the same as for TG (the complements), and so on.

28

Plot of percent identity against estimated distance in PAMs.

Curvature corresponds to repeated substitutions. Increasing spread probably corresponds to larger variance and/or lack of fit.

29

Goodness-of-fit

A diagnostic tool for evaulating the goodness of fit of the model to all the 4x4 tables is the Pearson chi-squared statistic. Suppose that ni and ti are the length and estimated distance for the ith HSP. Then the expected count matrix is

Define the statistic

€

χ i2 =

[ν (i,a,b) − e(i,a,b)]2

e(i,a,b)a,b

∑ .

€

e(i)=e(i,) Q ,

) t i)=ni×

) F (

) t i),where

) F (t)=

) Π exp(

) Q t).

30

Goodness-of-fit, cont.

Under the model, if the number N of HSPs is large, and they are all long, then the N χ2 statistics are approximately independent from a χ2

14 distribution. To check this we simulate 1,000 HSPs of the same size as the first 1,000 in the data, using the MLEs for Q and the corresponding ti . The model parameters are then estimated and the χ2 statistics computed. Their mean and SD are about 14 and 5.3 = √(2x14), see the histogram on the next slide.

The mean and the SD of the 10,638 observed χ2 statistics turns out to be 53 and 54 respectively, indicating that the fit is very bad. This is partly due to compositional heterogeneity. Removing non-stationary HSPs helps, e,g, using only those HSPs with √nD < 2 leaves us with a mean 40 and SD of 33, a modest improvement.

31

Histogrtam of 1,000 χ 2 statistics on simulated HSPs.

Mean 14, SD √28 as expected.

32

GC-specific rate matrices

GC 10,000 x Q

Low -85 16 52 17

81

25

-87

25 -121 15

-123

16 83 15

18 53

Medium -99 20 62 18

20 -100 15 65

65 15 -101 20

18 62 19 -99

High -116 23 75 18

17 -88 16 54

55 16 -88 17

18 75 24 -117

33

GC-specific rate matrices: symmetric parts and stationary distributions

Group 10,000 x R

Low -277 82 268 57

82 -609 77 268

268 77 -637 84

57 268 84 -289

.30

.20

.20

.30

Medium -387 80 255 69

80 -409 62 254

255 62 -415 79

69 254 79 -300

.26

.25

.24

.25

High -548 81 257 87

81 -304 57 260

257 57 -303 81

87 260 81 -562

.21

.29

.29

.21

Transitions in bold type: note similarity across the tables..

34

Goodness-of-fit revisited

Group Mean χ2 SD of χ2

Overall 53 54

Low GC 34 26

Medium GC 32 29

High GC 40 38

Here are the summaries of the χ2 statistics. Although far from good, they are much better. We conclude that heterogeneity contributes to the lack of fit of the model.

35

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

36

Goodness-of-fit, almost concluded.

Dividing our HSPs into three groups based on GC-content helps, but still leaves considerable heterogeneity in each group. An important issue is base composition, related to but more complex than GC-content.

We now consider smaller, more homogeneous groups of HSPs, and check the adequacy of model within them.

Consider those with composition close to = (.26, .25, .24, .25). If an HSP has length n and marginal frequencies h and m, its standardized distance from is given by

√n/2 x a|h(a) + m(a) - 2 (a)|.

A rough cut-off can be obtained by simulation based on the overall rate matrix and associated distances, and we find the mean and SD of simulated standardized distances are 1.2 and 0.5, respectively. Fitting the model to the 1,305 HSPs with standardized distance <1.7 gives and estimated rate matrix quite close to the overall rate matrix. Further, the mean and SD of the χ2 statistics are 18 and 10 respectively. The qq-plot is the next page but one.

37

χ214

qq-plot of HSPs with composition close to (.2,.3,.2,.3)

38

Goodness-of-fit, concluded.

We see some HSPs not well fitted by our model. These tend to have high imbalance. Removing those HSPs with imbalance exceeding 2 leaves 745 HSPs. Fitting the model to them gives the same rate matrix, but now the mean and SD of the χ2

are 14,5 and 5.1 respectively. This suggests that the lack of fit is mainly due to non-stationarity, and that stationary HSPs seem to have evolved according to a reversible process, which is also nearly strand-symmetric.

Applying the same procedure for the compositions (.2,.3,.3,.2) and (.3,.2,.2,.3) and (.2,.3,.2,.3) leads to similar observations: up to 40% of the HSPs in each group are non-stationary, and removing them makes the fit very good. For the strand-symmetric compositions (all exept (.2,.3,.2,.3)), the estimated rate matrices are all nearly strand-symmetric, indicating that the evolution of those stationary HSPs is probably strand-symmetric.

molecular evolution, cont. comparing estimation methods. application to human and mouse sequences

Documents

pairs of sequences

sequences xy

related sequences

resolvent r

estimate of q

rate matrix q

blosum matrices

estimate of r