inadequacies of minimum spanning trees in molecular epidemiology stephen j. salipante and barry g....

30
Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p. 3568–3575 Speaker: Bin-Shenq Ho Dec. 19, 2011

Upload: christopher-wells

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Inadequacies of Minimum Spanning Treesin Molecular Epidemiology

Stephen J. Salipante and Barry G. Hall

JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p. 3568–3575

Speaker: Bin-Shenq Ho

Dec. 19, 2011

Page 2: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p
Page 3: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Underlying Reasoning

• How will be the representativeness of a single, arbitrarily selected MST in terms of potentially many equally optimal solutions

• How could be the role of statistical metrics in the credibility of MST estimations

Page 4: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Materials and Methods

MST gold

http://www.bellinghamresearchinstitute.com

http://web.me.com/barryghall/

Max amount of time

Max number of unique MSTs

Min rate of new discovery

Page 5: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Materials and Methods

Distance matrix calculation

• Equidistant method

sequence, spoligotype, SNP

• Difference method

VNTR

Page 6: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

spoligotypespacer oligonucleotide type

(http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm)

Page 7: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

VNTRvariable number of tandem repeat

(http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm)

Page 8: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

MLSTmultilocus sequence type

• The procedure characterizes isolates of bacterial species using the DNA sequences of internal fragments of multiple housekeeping genes.

• For each housekeeping gene, the different sequences present within a bacterial species are assigned as distinct alleles and, for each isolate, the alleles at each of the loci define the allelic profile or sequence type (ST).

• Nucleotide differences between strains can be checked at a variable number of genes depending on the degree of discrimination desired.

(http://en.wikipedia.org/wiki/Multilocus_sequence_typing)

Page 9: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Materials and Methods

MSTs estimation and MSNs creation

• Kruskal’s algorithm with input by node

order randomization

• Combination of all edges defined within

unique MSTs constitutes MSN.

Page 10: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p
Page 11: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Materials and Methods

Number estimation of possible MSTs

through mark-recapture (Schnabel method)

N = [(M + 1)(C + 1)] ÷ (R + 1) - 1

N + 1 = [(M + 1)(C + 1)] ÷ (R + 1)

(M + 1) ÷ (N + 1) = (R + 1) ÷ (C + 1)

M : Mark

C : Current

R : Recapture

Page 12: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Materials and Methods

Bootstrapping

• To establish confidence level of a model

• 100 individual pseudoreplicates for each MST

• Bootstrap value expressed as the fraction of pseudoreplicates

yielding the same inference as the original data

• Given enough information, there should be sufficiently redundant

data that independent pseudoreplicates will yield analyses identical

to that of the complete data set.

Page 13: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p
Page 14: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

BootstrapEfron and Gong (1983)

Diaconis and Efron (1983)

Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791

Inferring the variability in an unknown distribution from which your data were drawn by resampling from the data

Page 15: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Results

Estimating alternative MSTs

• Multiple, equally parsimonious solutions possible

• Kruskal’s MST algorithm sensitive to node input

order

• Schnabel method appropriate to estimate the

number of alternative MSTs, esp. after discarding

the early cycles of node order randomization

Page 16: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p
Page 17: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

note for number estimation of possible MSTs

through mark-recapture (Schnabel method)

N = [(M + 1)(C + 1)] ÷ (R + 1) - 1

N + 1 = [(M + 1)(C + 1)] ÷ (R + 1)

(M + 1) ÷ (N + 1) = (R + 1) ÷ (C + 1)

M : Mark

C : Current

R : Recapture

Page 18: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

The number of possible MSTs is proportional only to the number of minimal pairwise distances with equal lengths.

There is a relationship between the number of possible MSTs and the method used to compute the pairwise distance matrix.

Page 19: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

note for distance matrix computation

• Equidistant method – sites scored merely as “same” or “different” such that any difference carries the same weight

• Difference method – distances between sites calculated on the basis of the difference between the values of the two sites

Page 20: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

There were significantly fewer alternative MSTs possible when the same data were processed using the difference method.

There is a relationship between the type of data used and the number of possible alternative MSTs.

Page 21: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Results

Estimating alternative MSTs

• When there are limited numbers of informative sites and alleles are treated as equidistant from one another, there are many pairwise distances of the same length, and large numbers of MSTs are possible.

• Basing analyses on the arithmetic number of pairwise differences among individuals both limits the number of possible MSTs and more faithfully represents the genetic distances between individuals.

Page 22: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Results

Creating MSN

• Approximation by majority rule

dashed line – edges present in ≧ 50% of MSTs

solid line – edges present in 100% of MSTs

• Fraction ≠ Credibility

Page 23: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p
Page 24: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

ResultsEstimating credibility of MSTs

Within any set of alternative MSTs examined, the individual trees demonstrated a considerable range of average bootstrap values.

Although all MSTs in the MSN are equally parsimonious, some tree configurations are more statistically robust.

Page 25: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Results

Estimating credibility of MSTs

• By restricting analysis to a single, arbitrary MST, there is considerable risk in picking a tree with an inferior credibility.

• By surveying and evaluating trees within the MSN, it is possible to identify those with more credible configurations.

Page 26: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

ResultsSystematic approach

toMST estimation

Page 27: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Discussion

• Failing to consider alternative solutions (MSTs) can easily mislead or confound our understanding of population structure.

• Molecular epidemiology has yet to adopt measures to evaluate the credibility of the estimation.

• Presenting a single MST neither explores the range of alternative hypotheses nor evaluates the quality of MSTs based on their relative credibilities.

Page 28: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Discussion ~ proposed approach to MST analysis ~

• 1. The distance matrix that maximizes the differences

between individuals is calculated. For VNTR data, a distance

matrix calculated by the difference method should be used,

and for MLST data, distances should be computed from the

underlying DNA sequence data.

• 2. Instead of returning a single, arbitrarily selected MST, the

MSN (representing or approximating the entire population of

alternative MSTs) is reported. The total number of possible

MSTs is estimated using a mark-recapture calculation.

Page 29: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Discussion ~ proposed approach to MST analysis ~

• 3. A bootstrapping metric is employed to estimate the

credibility of individual MSTs within the population of

alternative solutions comprising the MSN. As many MSTs

as time permits are subjected to bootstrap analysis so that

the most reliable MST topology can be estimated and

statistical support for particular relationships may be

ascertained.

• 4. The most credible hypothesis or hypotheses within the

larger population of MSTs are reported.

Page 30: Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p

Thanks for Your Attention !