inadequacies of minimum spanning trees in molecular epidemiology stephen j. salipante and barry g....

Post on 04-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Inadequacies of Minimum Spanning Treesin Molecular Epidemiology

Stephen J. Salipante and Barry G. Hall

JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p. 3568–3575

Speaker: Bin-Shenq Ho

Dec. 19, 2011

Underlying Reasoning

• How will be the representativeness of a single, arbitrarily selected MST in terms of potentially many equally optimal solutions

• How could be the role of statistical metrics in the credibility of MST estimations

Materials and Methods

MST gold

http://www.bellinghamresearchinstitute.com

http://web.me.com/barryghall/

Max amount of time

Max number of unique MSTs

Min rate of new discovery

Materials and Methods

Distance matrix calculation

• Equidistant method

sequence, spoligotype, SNP

• Difference method

VNTR

spoligotypespacer oligonucleotide type

(http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm)

VNTRvariable number of tandem repeat

(http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm)

MLSTmultilocus sequence type

• The procedure characterizes isolates of bacterial species using the DNA sequences of internal fragments of multiple housekeeping genes.

• For each housekeeping gene, the different sequences present within a bacterial species are assigned as distinct alleles and, for each isolate, the alleles at each of the loci define the allelic profile or sequence type (ST).

• Nucleotide differences between strains can be checked at a variable number of genes depending on the degree of discrimination desired.

(http://en.wikipedia.org/wiki/Multilocus_sequence_typing)

Materials and Methods

MSTs estimation and MSNs creation

• Kruskal’s algorithm with input by node

order randomization

• Combination of all edges defined within

unique MSTs constitutes MSN.

Materials and Methods

Number estimation of possible MSTs

through mark-recapture (Schnabel method)

N = [(M + 1)(C + 1)] ÷ (R + 1) - 1

N + 1 = [(M + 1)(C + 1)] ÷ (R + 1)

(M + 1) ÷ (N + 1) = (R + 1) ÷ (C + 1)

M : Mark

C : Current

R : Recapture

Materials and Methods

Bootstrapping

• To establish confidence level of a model

• 100 individual pseudoreplicates for each MST

• Bootstrap value expressed as the fraction of pseudoreplicates

yielding the same inference as the original data

• Given enough information, there should be sufficiently redundant

data that independent pseudoreplicates will yield analyses identical

to that of the complete data set.

BootstrapEfron and Gong (1983)

Diaconis and Efron (1983)

Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791

Inferring the variability in an unknown distribution from which your data were drawn by resampling from the data

Results

Estimating alternative MSTs

• Multiple, equally parsimonious solutions possible

• Kruskal’s MST algorithm sensitive to node input

order

• Schnabel method appropriate to estimate the

number of alternative MSTs, esp. after discarding

the early cycles of node order randomization

note for number estimation of possible MSTs

through mark-recapture (Schnabel method)

N = [(M + 1)(C + 1)] ÷ (R + 1) - 1

N + 1 = [(M + 1)(C + 1)] ÷ (R + 1)

(M + 1) ÷ (N + 1) = (R + 1) ÷ (C + 1)

M : Mark

C : Current

R : Recapture

The number of possible MSTs is proportional only to the number of minimal pairwise distances with equal lengths.

There is a relationship between the number of possible MSTs and the method used to compute the pairwise distance matrix.

note for distance matrix computation

• Equidistant method – sites scored merely as “same” or “different” such that any difference carries the same weight

• Difference method – distances between sites calculated on the basis of the difference between the values of the two sites

There were significantly fewer alternative MSTs possible when the same data were processed using the difference method.

There is a relationship between the type of data used and the number of possible alternative MSTs.

Results

Estimating alternative MSTs

• When there are limited numbers of informative sites and alleles are treated as equidistant from one another, there are many pairwise distances of the same length, and large numbers of MSTs are possible.

• Basing analyses on the arithmetic number of pairwise differences among individuals both limits the number of possible MSTs and more faithfully represents the genetic distances between individuals.

Results

Creating MSN

• Approximation by majority rule

dashed line – edges present in ≧ 50% of MSTs

solid line – edges present in 100% of MSTs

• Fraction ≠ Credibility

ResultsEstimating credibility of MSTs

Within any set of alternative MSTs examined, the individual trees demonstrated a considerable range of average bootstrap values.

Although all MSTs in the MSN are equally parsimonious, some tree configurations are more statistically robust.

Results

Estimating credibility of MSTs

• By restricting analysis to a single, arbitrary MST, there is considerable risk in picking a tree with an inferior credibility.

• By surveying and evaluating trees within the MSN, it is possible to identify those with more credible configurations.

ResultsSystematic approach

toMST estimation

Discussion

• Failing to consider alternative solutions (MSTs) can easily mislead or confound our understanding of population structure.

• Molecular epidemiology has yet to adopt measures to evaluate the credibility of the estimation.

• Presenting a single MST neither explores the range of alternative hypotheses nor evaluates the quality of MSTs based on their relative credibilities.

Discussion ~ proposed approach to MST analysis ~

• 1. The distance matrix that maximizes the differences

between individuals is calculated. For VNTR data, a distance

matrix calculated by the difference method should be used,

and for MLST data, distances should be computed from the

underlying DNA sequence data.

• 2. Instead of returning a single, arbitrarily selected MST, the

MSN (representing or approximating the entire population of

alternative MSTs) is reported. The total number of possible

MSTs is estimated using a mark-recapture calculation.

Discussion ~ proposed approach to MST analysis ~

• 3. A bootstrapping metric is employed to estimate the

credibility of individual MSTs within the population of

alternative solutions comprising the MSN. As many MSTs

as time permits are subjected to bootstrap analysis so that

the most reliable MST topology can be estimated and

statistical support for particular relationships may be

ascertained.

• 4. The most credible hypothesis or hypotheses within the

larger population of MSTs are reported.

Thanks for Your Attention !

top related