a comparison of ara and dna data for microbial source tracking based on source-classification models...
TRANSCRIPT
ARTICLE IN PRESS
Available at www.sciencedirect.com
WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 4
0043-1354/$ - see frodoi:10.1016/j.watres
�Corresponding auE-mail address:1 /http://www.e
journal homepage: www.elsevier.com/locate/watres
A comparison of ARA and DNA data for microbial sourcetracking based on source-classification models developedusing classification trees
Bertram Pricea,�, Elichia Vensob,c, Mark Franab, Joshua Greenberga, Adam Warea
aPrice Associates, Inc., One N. Broadway Ste 406, White Plains, NY 10601, USAbDepartment of Biological Sciences, Salisbury University, Salisbury, MD 21801, USAcEnvironmental Health Sciences, Salisbury University, Salisbury, MD 21801, USA
a r t i c l e i n f o
Article history:
Received 16 January 2007
Received in revised form
8 May 2007
Accepted 10 May 2007
Available online 24 May 2007
Keywords:
MST
Classification trees
DNA
ARA
Library size
Reliability
Enterococcus spp.
nt matter & 2007 Elsevie.2007.05.026
thor. Tel.: +1 914 686 7975;[email protected]/owow/tmdl/intro
a b s t r a c t
The literature on microbial source tracking (MST) suggests that DNA analysis of fecal
samples leads to more reliable determinations of bacterial sources of surface water
contamination than antibiotic resistance analysis (ARA). Our goal is to determine whether
the increased reliability, if any, in library-based MST developed with DNA data is sufficient
to justify its higher cost, where the bacteria source predictions are used in TMDL surface
water management programs. We describe an application of classification trees for MST
applied to ARA and DNA data from samples collected in the Potomac River Watershed in
Maryland. Conclusions concerning the comparison of ARA and DNA data, although
preliminary at the current time, suggest that the added cost of obtaining DNA data in
comparison to the cost of ARA data may not be justified, where MST is applied in TMDL
surface water management programs.
& 2007 Elsevier Ltd. All rights reserved.
1. Introduction
Microbial contamination of surface waters resulting from
fecal materials in a watershed is a major environmental
health concern. Microbial source tracking (MST), the initial
investigation for reducing surface water contamination,
includes identifying the sources of fecal material in a
watershed and determining how much each source contri-
butes to the contamination. Based on MST information, fecal
bacteria sources may be targeted for reduction, which in turn
would be expected to reduce microbial surface water loading
to acceptable levels, such as those consistent with EPA’s total
maximum daily load (TMDL) guidelines (US EPA, 1986).1
r Ltd. All rights reserved.
fax: +1 914 686 7977.c.com (B. Price)..htmlS; Maryland Depart
Briefly, fecal (scat) samples are collected from known
potential sources of surface water bacterial contamination in
a specific watershed of interest. The bacteria from these
sources are isolated, and may be analyzed by a number of
phenotypic and genotypic methods (Stoeckel and Harwood,
2007). The results of these analyses constitute a reference
library of source-specific bacterial isolates, which is used to
develop a source-classification model. The model is then
applied to water samples from the targeted surface water body
in order to estimate relative contributions of bacterial contam-
ination into the surface water by each source. For application in
a TMDL program, the relative sources are then quantified in
terms of loads. Using the source loads, bacteria-loading
ment of the Environment (2005).
ARTICLE IN PRESS
WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43576
reduction goals are set for each source in order to reduce the
concentration of bacterial contaminant in the targeted surface
water to within regulatory levels.
The two predominant types of data that have been used for
MST are based on antibiotic resistance analysis (ARA) and
DNA fingerprinting (Graves et al., 2002; Hagedorn et al., 1999;
Harwood et al., 2000, 2003; Hassan et al., 2005; Parveen et al.,
1997; Simpson et al., 2002; US EPA, 2005; Whitlock et al., 2002;
Wiggins et al., 1999). Our interest is in comparing the
reliability of these two types of data for library-based MST.
While reliability for an application of MST is based on the
probability of correct classification of surface water samples
containing bacteria among a set of potential sources, a
meaningful evaluation of reliability requires the context of a
specific application. The context we employ is the application
of MST in TMDL surface water management programs. Our
objective is to determine whether increased reliability, if any,
of library-based MST developed with DNA data is sufficient to
justify its higher cost, where source predictions are used in
TMDL surface water management programs.
The probability of correct source classification depends on
various factors including (1) the index bacteria organism
employed; (2) the number of sources; (3) the size of the library
used to develop the MST model (i.e., the number of samples
collected and the corresponding number of isolates prepared
and analyzed); (4) the degree to which library samples
represent the sources in the watershed; (5) the measurement
method used to characterize the isolates; and (6) the
statistical model used for bacterial source classification
(Johnson et al., 2004; Lasalde et al., 2005). To focus attention
on the comparison of ARA data versus pulsed-field gel
electrophoresis (PFGE) DNA data, we held a few of these
factors constant: the index organism was Enterococcus spp.;
there were four bacterial sources (human, pet, livestock, and
wildlife); and the source classification model was developed
using classification trees. This analysis of ARA versus DNA
data for MST provides a partial answer for comparing the
rates of correct classification for each type of data. The results
suggest that while source classification using DNA data has a
higher correct classification rate than ARA data, this advan-
tage does not necessarily translate into different strategies for
achieving TMDL objectives, and may therefore not be
sufficient to justify the higher cost of obtaining DNA data.
2. Materials and methods
2.1. Sample collection and library representation of thewatershed
Scat samples should be collected in a manner that provides
coverage for the entire watershed. Stratification and propor-
tional sampling by strata could be used if information were
available indicating that certain areas within the watershed
were more likely as sources of fecal bacteria than other areas.
Also, if information were available on the relative population
sizes of the contributing sources (i.e., pet, human, livestock,
and wildlife for the watershed), the source categories could be
used to define strata and a proportional sampling scheme for
these strata could be designed. However, in any initial MST
investigation, it is unlikely that the information needed to
define strata would be well developed. Therefore, a sampling
design based on a combination of expert scientific judgment
and convenience is a practical choice. In our example, scat
collection focused on the most likely sources of fecal
contamination for the Potomac River and included 283 scat
samples from humans (sewage), and a variety of different
livestock, pets, and wildlife.
2.2. Isolation and identification of bacteria
Enterococcus spp. was the indicator organism used for this
MST study (Wheeler et al., 2002). Up to eight enterococcal
isolates were obtained from each of the 283 scat samples
collected in the Potomac River Watershed from the afore-
mentioned sources. Bacterial isolates that produced reddish
colonies when grown on m-Enterococcus Agar (DifcoTM)
and turned EnterococcoselTM broth (BBLTM) black (esculin
positive) were presumptive Enterococcus genus and were used
for ARA.
Prior to DNA fingerprinting, the same enterococci isolates
used for ARA were further identified to species using carbon
source metabolic fingerprint analysis by BiOLOG MicroLogTM
System (Hayward, CA). Only those isolates specifically identi-
fied as Enterococcus casseliflavus, Enterococcus faecalis, and
Enterococcus faecium were analyzed for DNA fingerprints using
PFGE, as these were the most abundant species. To ensure fair
comparison of ARA and DNA data, only isolates identified to
those three bacterial species were used for both types of
analysis (n ¼ 234).
2.3. Antibiotic resistance analysis
Enterococcus spp. isolates were inoculated into Enterococcosel
broth contained in each of 48 wells of a 96-well plate. Plates
were then incubated overnight at 37 1C. By using a replicate
plater, isolates were transferred from the 48 wells onto 31
agar plates, each containing a different antibiotic-concentra-
tion combination. Agar plates were incubated at 37 1C for 24 h
and then individual isolates were scored for growth or no
growth. Isolates that grew on a particular plate were
considered resistant to the antibiotic at that concentration,
and isolates that did not grow were considered susceptible.
2.4. DNA fingerprint analysis by PFGE
DNA was extracted from the bacterial isolates and embedded
into agarose plugs as previously described (Murray et al.,
1990). Bacterial DNA was digested using the restriction
enzyme Sma I. The DNA fragments generated from these
isolates were separated by PFGE using the CHEF MapperTM
(BioRad Laboratories, Hercules, CA). Gels were photographed
and banding patterns were quantified using BioNumericss,
Applied Maths, Inc. (Austin, TX).
2.5. Data characteristics
ARA and DNA laboratory analyses of isolates may be recorded
either as binary or as interval data. For ARA, isolate bacteria
are treated with different fixed concentrations of antibiotics.
ARTICLE IN PRESS
WAT E R R E S E A R C H 41 (2007) 3575– 3584 3577
When the bacteria grow in the presence of a particular
antibiotic at a particular concentration, this is recorded as a 1,
signifying resistance. When the bacteria do not grow in the
presence of a particular antibiotic-concentration treatment,
this is recorded as a 0, signifying susceptibility. These data
yield an antibiotic resistance profile consisting of 1’s and 0’s
for each antibiotic at each concentration. Alternatively, the
ARA laboratory results may be condensed by recording the
highest concentration of each antibiotic of those tested for
which the bacteria are resistant, thereby yielding an antibiotic
resistance profile consisting of interval data. For example, if
bacteria are treated with 10 antibiotics at 5 different
concentrations each, then the data may be recorded as 50
binary variables or 10 interval variables. If we impose the
biologically indicated assumption that bacteria will be
resistant to all concentrations of an antibiotic less than the
highest concentration for which the bacteria are resistant,
then there is a one-to-one relationship between ARA binary
and interval data as long as no data are missing. Our dataset
was complete, and we checked it to make sure that this
assumption held true. We therefore used the interval data,
due to its lower dimensionality.
For DNA fingerprint analysis, photographs of the electro-
phoresis gel runs capture arrays of bands corresponding to
different DNA fragment lengths. The intensities of the bands
relate the density of these fragments. The results may be
recorded either as interval data, the grayscale intensity of the
bands at different fragment lengths, or as binary data, the
presence or absence of the bands at different fragment
lengths. Given that binary and interval data for DNA
fingerprint analysis have the same dimension, interval data,
due to their higher information content, are preferable for
statistical analysis.
2.6. Statistical method—classification trees
2.6.1. AlgorithmThe classification tree method builds classification models by
recursively splitting into two the reference library of isolates
such that the resulting subsets, termed nodes, are increas-
ingly homogeneous with respect to isolate sources (Breiman
et al., 1998; Hastie et al., 2001; Price et al., 2006).2 The first split
divides the reference library of isolates into two nodes by
considering every binary split defined by the outcome
associated with the range of values for every variable. A
variable is selected for the split that maximizes homogeneity
of isolate sources within each of the two resultant nodes. The
same procedure is applied to each of the resultant nodes, and
a tree of recursively partitioned data is grown until a stopping
rule is invoked or until each node is homogeneous by source
category and cannot be split any further. The collection of
nodes at the ends of the branches of the tree, called terminal
nodes, constitutes a classification model. That is, each
terminal node is characterized by a unique profile of variables
and their range of values, and is associated with one source
category, the most prevalent source category represented by
isolates in the node. Consequently, any isolate, whether from
2 We used Classification and Regression Trees (CARTs) soft-ware developed by Salford Systems to build classification trees.
the library or a water sample of unknown source, that has the
same variable profile as that specified by a particular terminal
node would be assigned the source associated with that
terminal node.
Various approaches for determining when to conclude the
node-splitting process have been suggested and analyzed
(Breiman et al., 1998). Stopping criteria, or rules for determin-
ing when to stop splitting nodes, are typically based on a
measure of source homogeneity for isolates within a node.
Source homogeneity achieves its maximum when all isolates
in a node have the same source. A node where an additional
split would not yield a specified minimum increase in
homogeneity would be a terminal node.
Hastie et al. (2001) describe three criteria or indexes that
address homogeneity. As formulated, the indexes measure
the complement of homogeneity, which Hastie et al. refer to
as impurity. Homogeneity is maximized when impurity is
minimized. The three impurity indexes are Misclassification
error, Gini index, and Cross-entropy deviance. The optimal
split of a node at any stage of model development is the one
that minimizes impurity as measured by the chosen index.
We relied on the Gini index for developing the classification
tree models in our examples.
The Gini index is calculated for a node labeled m asX
k
pmkð1� pmkÞ,
where p is the proportion of isolates from source k in node m.
The minimum value taken by the Gini index is 0, which
occurs when all isolates in a node are from the same source
(i.e., minimum impurity corresponds to maximum homo-
geneity). The maximum of the Gini index occurs when
the isolates in a node are equally distributed among the
sources.
At this time, we know of no general guidance concerning a
preference for one of the impurity indexes for MST modeling.
Following the general suggestion that model building is an
exploratory data analysis process, experimentation with each
of the indexes may be beneficial.
2.6.2. Source identification for terminal nodesThe classification model, equivalently the set of rules for
classifying isolates by source, is embodied in the collection of
terminal nodes. Each terminal node in a tree represents
isolates with a particular combination of outcomes for the
classification variables and assigns the source with highest
probability in the node to all such isolates.
The source probabilities in a terminal node are posterior
probabilities of the events: ‘‘source, given the pattern of
variables in the terminal node.’’ The posterior probability for a
particular source S in terminal node T is
PðSjTÞ ¼ ½PðTjSÞ � PðSÞ�=PðTÞ, (1)
where P(SjT) is the posterior probability that an isolate
comes from source S given that the isolate is in terminal
node T; P(TjS) is the conditional probability that an isolate
is in terminal node T given that it comes from source
S; P(S) is the prior probability that an isolate comes from
source S; and P(T) is the overall probability that an isolate is in
node T.
ARTICLE IN PRESS
WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43578
Where there are four source categories, the posterior
probability for the ith source would be calculated as
PðSijTÞ ¼PðTjSiÞ � PðSiÞPj
PðTjSjÞ � PðSjÞ. (2)
P(TjSj) can be estimated from the isolates in the reference
library as {nj/Nj}, where nj is the number of isolates from
source j in terminal node T and Nj is the total number of
isolates from source j in the reference library. P(S), the prior
probability distribution of source identity, must be specified
as a modeling parameter that is not necessarily determined
by the library data. Therefore, the prior probabilities may play
a significant role in determining source predictions in the
classification tree model.
There are two immediately apparent choices for assigning
values to the prior probabilities: (1) P(Si) is proportional to
the population size of the ith source in the watershed, which
if estimated from the isolates in the reference library
would be Ni/P
Nj, where {Ni, i ¼ 1, 2, 3, 4} are the numbers of
isolates from source i in the library; (2) equal values, 0.25 for
each of the four sources, which is referred to as a non-
informative prior used when little or no information is
available about the distribution of sources in the watershed.
If the source proportions from the library were used as
prior probabilities in Eq. (2), the posterior source proba-
bilities would be the proportions of the sources for isolates in
that node. Then, the node would be represented by the
majority source among the isolates in the node. Subse-
quently, any isolate of unknown source that would be
classified by the tree classification model into the terminal
node under discussion would be assigned the majority source
for that node.
However, using the source proportions among the isolates
in the reference library as prior probabilities is not advisable
in most circumstances. It is unlikely that the proportions of
isolates in the reference library from a particular source
would reflect the true proportions of potential sources of
bacteria in the watershed. Even if the collection of samples
provides adequate coverage of the sources in the watershed,
the numbers of isolates for each source are unlikely to be
proportional to actual source distribution.
The second obvious choice for prior probabilities, equal
probabilities of 0.25 for each of the four sources in the
Potomac analysis, is a more acceptable choice. Absent specific
reliable data on the source contributions of bacteria from the
watershed, the equal priors properly reflect ignorance con-
cerning the source contributions and, in that respect, lead to
an unbiased classification model. Under the equal prior
probability choice, the posterior probabilities would be
calculated using Eq. (2), with each P(Sj) replaced by 0.25. The
posterior probability for the ith source would be
PðSijTÞ ¼ ðni=NiÞ=Xðnj=NjÞ. (3)
2.6.3. Model stabilityA classification model is stable if it consistently classifies new
isolates from known sources accurately. Stability depends on
the size of the reference library, how well the isolates in the
library represent the watershed, the number of terminal
nodes in the classification model, and the number of isolates
in each terminal node. A model with a large number of
terminal nodes will classify the isolates in the library with
great accuracy, but if the number of isolates in each terminal
node is small, the model may not perform well on samples
that were not in the library. For classification tree models in
general, model stability involves a tradeoff among library size,
number of terminal nodes, and number of isolates per
terminal node. Stability may be advanced by imposing size
limitations on the terminal nodes. In our example, we limited
terminal node size indirectly by requiring that a node contain
at least 10 isolates in order to be considered for further
splitting. Limiting the number of isolates in a terminal node is
equivalent to imposing a statistical precision requirement on
the estimates of posterior probabilities for that node. This
requirement can be used to determine appropriate data
reference library size, an important issue addressed in
Section 2.6.4.
2.6.4. Reference library sizeThe size of the reference library is determined by specifying
the number of field samples to be collected, the number of
isolates to be grown from each sample, and the number of
isolates to be analyzed. Determining the library size may
follow the usual approach for setting requirements for the
number of observations in a statistical sample: (1) state a
well-defined objective for the statistical analysis; (2) select a
criterion that measures the performance of the statistical
analysis method; and (3) select a large probability for assuring
that the criterion in (2) is met. This general approach,
although conceptually straightforward for typical statistical
objectives such as estimating parameters or testing hypoth-
eses, can be complicated in application and has not been
applied in the MST literature.
However, for source classification using the classification
tree model, library size may be determined by focusing on the
posterior probabilities in terminal nodes. Each terminal node
is associated with one source, the source corresponding to the
largest posterior probability among sources of the isolates in
the node. This association is determined by a statistical
estimate of posterior probabilities for the node. Therefore, the
statistical objective for determining the size of the library is to
assure with a high probability that the source corresponding
to the largest estimated posterior probability in a terminal
node is, in fact, the source for the true largest posterior
probability among all sources. One additional requirement
must be imposed for determining library size. Correct
selection of the source with the largest posterior probability
is assured only if the largest true posterior probability is larger
than the second largest true posterior probability by a
specified amount. If a small difference were specified, a larger
library would be required than if a larger difference were
specified. In addition, the probability of correct selection
(P(CS)) should be relatively large, at least 0.90.
This formulation for determining library size for MST is an
adaptation of the ranking and selection problem in statistics
(Gibbons et al., 1977). The ranking and selection objective is to
determine the number of samples required to select the
‘‘best’’ population from among a number of candidate
populations. For MST modeling, the populations are the
ARTICLE IN PRESS
WAT E R R E S E A R C H 41 (2007) 3575– 3584 3579
sources and ‘‘best’’ means largest posterior probability of
source for a terminal node. The number of samples must be
large enough to satisfy the two requirements mentioned
above: (1) correct selection should be assured when the
largest value of the posterior probability exceeds the second
largest by a specified amount; and (2) the probability of
making the correct selection subject to this requirement is at
least a pre-specified probability (e.g., 0.90, 0.95, or 0.99).
In each terminal node, we use the posterior probability to
identify one source among the four sources as the represen-
tative of that node. Where the prior probabilities are constant,
which we adopt in our modeling approach, the estimates of
the four posterior probabilities in a particular terminal node
are given by Eq (3). To address the number of samples issue,
we adopt one simplification. We proceed by assuming that
the numbers of isolates in the library from each source are
the same (i.e., Ni ¼ N for all i). With this simplification, the
estimates of the posterior probabilities become {ni/P
ni},
the ratios of the numbers of isolates for each source in the
terminal node divided by the total number of isolates in the
terminal node. Additionally, the data in the terminal node
have a multinomial distribution (Ross, 2005).
We apply sample size tables developed for selecting the
category of a multinomial distribution with the largest
probability (Gibbons et al., 1977) to determine the number of
isolates in each terminal node. As an example, the number of
isolates required in a terminal node to assure correct
selection, with probability equal to 0.90, of the source
category with the largest posterior probability when the
largest posterior probability exceeds the second largest
posterior probability by a factor of 2.0 is 43 (Table 1).
To determine the total number of samples and isolates for
the library, we need to specify the number of isolates per
sample and the number of terminal nodes we allow in the
classification model. For example, consider a design with 10
isolates per sample and a classification model with 20
terminal nodes. With 20 terminal nodes, 43 isolates per
Table 1 – Total number of isolates per terminal node toassure a specified probability of selecting the source withthe largest probability
Ratio of largestprobability to secondlargest probability
Probability of correctselection—P(CS)
0.90 0.95 0.99
1.2 692 979 1660
1.4 196 278 471
1.6 98 139 235
1.8 61 87 147
2.0 43 61 104
2.2 33 46 79
2.4 26 37 63
2.6 22 31 53
2.8 19 26 45
3.0 16 23 39
Source: Gibbons et al. (1977).
terminal node translate into 860 total isolates in the library, or
86 samples with 10 isolates each. The number of samples for
each source would be approximately 22. If P(CS) were required
to be 0.99, then the number of isolates per terminal node
would be 104 (Table 1). For 20 terminal nodes and 10 isolates
per sample, the library must consist of approximately 208
samples and 2080 isolates. Our data from the Potomac River
Watershed are still being developed, and currently contain
fewer isolates than recommended by this method of deter-
mining appropriate data library size.
2.6.5. Classification probability thresholdsWhere the number of isolates in the library is determined by
the method described above, the source associated with each
terminal node will be, with a high probability, correctly
identified as the source with the largest posterior probability
among the sources with isolates in that node. In the above
example, we chose a design that assures correct selection of
the source provided the largest posterior probability is at least
twice the second largest posterior probability. However, even
though the terminal node has been correctly assigned a
source, the probability of source membership for isolates in
that node may not be large enough for source classification
with a high level of confidence. Consider two examples. In the
first example, the estimate of the largest posterior probability
is 0.80. In the second example, the estimate of the largest
posterior probability is 0.50. Whereas we may have comfort in
assigning an isolate to a source where the probability of
belonging to that source is 0.80, we would be uncertain about
the correctness of the assignment if the probability were 0.50.
This uncertainty can be reduced by establishing a threshold
for the maximum probability in a terminal node before
accepting the node as a basis for assigning isolates to the
source associated with the node. If the maximum posterior
probability in a terminal node does not exceed the threshold,
isolates in that terminal node would be classified as ‘‘source
unknown.’’
For example, consider a node that contains 6 pet, 6 human,
12 livestock, and 36 wildlife isolates. The posterior probabil-
ities would be 10% pet, 10% human, 20% livestock, and 60%
wildlife. Without any threshold, all isolates falling into this
node would be classified as wildlife. If the threshold were 55%,
the node and all isolates in it would also be classified as
wildlife. With a threshold of 70%, isolates falling into this
node would be classified as ‘‘source unknown.’’
A threshold probability reflects our confidence in the
classification scheme. While increasing the threshold in-
creases the correct classification rate, it also increases the
number of isolates classified as unknown. The tradeoff
between correct classification rate and number of unknowns
may be investigated for each library and source classification
model by varying the probability threshold value.
3. Results
Results for the classification tree method based on resub-
stitution are presented in the figures and tables. We provide
these preliminary results as part of an ongoing comparison of
ARA and DNA data for MST.
ARTICLE IN PRESS
WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43580
The library consisted of 234 Enterococcus isolates with
complete data derived from 55 samples, with the source
distribution shown in Table 2.
We developed source-classification models for the ARA and
DNA libraries using the classification trees methodology. For
illustration purposes, we provide abbreviated versions of the
classification tree models for the two types of data (Figs. 1
and 2). When viewing these figures it is important to keep in
mind that we used equal prior source probabilities. Therefore,
node source distribution does not reflect the posterior
probability distribution of sources in the node.
In Fig. 1, there are four terminal nodes stemming from
three splits of the ARA data library. In our model, these nodes
Table 2 – Distribution of samples across sources
Potomac River Watershed
Source Samples Isolates
Human 14 54
Livestock 10 32
Pet 5 18
Wildlife 26 130
Total 55 234
CEPHALOTHIN <= 12.500
N=59
STREPTOMYCIN<= 20.000
N=22
S
CEPHALOTHIN >
N=135
CHLORTETRACYCLINE<= 70.000
N=194
Class
human
livestock
pet
wildlife
Cases %
42 21.6
16.532
1 0.5
119 61.3
Class = human
Class
human
livestock
pet
wildlife
Cases %
18 81.8
9.12
0 0.0
2 9.1
Class = livestock
Class
human
livestock
pet
wildlife
Cases %
2 3.4
40.724
0 0.0
33 55.9
Class
human
livestock
pet
wildlife
Cases
40
8
1
86
Clas
hum
lives
pet
wildl
Fig. 1 – First three splits in classification tree for Potomac Rover
were split further, eventually yielding 28 terminal nodes. We
employed a stopping rule requiring a minimum of 10 isolates
in a node before splitting.
In Fig. 2, there are four terminal nodes stemming from three
splits of the DNA data library. In our model, these nodes
were split further, eventually yielding 19 terminal nodes.
We employed the same stopping rule for DNA data as for ARA
data.
Tables 3 and 4 show the resubstitution results, comparing
the bacterial source-classification schemes for ARA and DNA
data to the actual sources. Modeling the DNA data yielded a
higher overall rate of correct reclassification (RCC), though
modeling the ARA data resulted in higher rates of correct
reclassification for pet and livestock isolates.
Applying a 70% threshold probability to the terminal
nodes in both models improved the rate of correct reclassi-
fication for both the ARA and DNA library data (Tables 5
and 6).
4. Discussion
We are currently engaged in a research project to compare the
efficiency of ARA data versus DNA data for MST within the
context of TMDL surface water bacterial contamination
TREPTOMYCIN> 20.000
N=113
12.500
CHLORTETRACYCLINE > 70.000
Class = pet
Class
human
livestock
pet
wildlife
N=40
N=234
Cases %
12 30.0
0.00
17 42.5
11 27.5
Class = wildlife
Class
human
livestock
pet
wildlife
Cases %
22 19.5
5.36
1 0.9
84 74.3
%
29.6
5.9
0.7
63.7
s
an
tock
ife
Cases %
54 23.1
13.732
18 7.7
130 55.6
Watershed ARA data. Antibiotic concentrations are in lg/ml.
ARTICLE IN PRESS
Table 3 – Resubstitution results for Potomac River Watershed ARA reference library isolates
Actual Predicted RCCa (%)
Human Livestock Pet Wildlife Total (%)
Human 37 4 3 10 54 (23.1) 68.5
Livestock 1 31 0 0 32 (13.7) 96.9
Pet 0 0 18 0 18 (7.7) 100.0
Wildlife 12 33 3 82 130 (55.6) 63.1
Total (%) 50 (21.4) 68 (29.1) 24 (10.3) 92 (39.3) 234
a Rate of correct classification. Overall percent correct ¼ 71.8%.
KB44_4 <= 10.000
KB92_5 <= 55.100 KB92_5 > 55.100
KB44_4 > 10.000
KB63_5 <= 83.300KB63_5 > 83.300
Class = human
Class Cases %human 54 23.1
livestock 32 13.7
pet 18 7.7
wildlife 130 55.6
N=234
Class Cases %
human 21 15.1
livestock 24 17.3
pet 18 12.9
wildlife 76 54.7
N=139
Class Cases %human 7 8.6
livestock 8 9.9
pet 17 21.0wildlife 49 60.5
N=81
Class Cases %human 33 34.7
livestock 8 8.4
pet 0 0.0
wildlife 54 56.8
N=95
Class = livestock
Class Cases %
human 14 24.1
livestock 16 27.6
pet 1 1.7
wildlife 27 46.6
N=58
Class = pet
Class Cases %human 3 7.1
livestock 3 7.1
pet 17 40.5
wildlife 19 45.2
N=42
Class = wildlife
Class Cases %human 4 10.3
livestock 5 12.8
pet 0 0.0
wildlife 30 76.9
N=39
Fig. 2 – First three splits in classification tree for Potomac River Watershed DNA data. Splitting variables are DNA fragment
length measured in kilobases.
3 While our results are for resubstitution, cross-validationprovides a better measure of model performance. When werepeat the analysis with a larger library we will measureperformance using cross-validation. For the current model, it islikely that RCC’s for cross-validation would be reduced for bothtypes of data, but DNA RCCs would remain larger. Therefore, theresubstituion results are adequate for this preliminary analysis.
WAT E R R E S E A R C H 41 (2007) 3575– 3584 3581
management. The results we report in this article are a
preview of the results we are developing on this topic.
Currently, the library contains fewer isolates than recom-
mended by the library size determination method for
classification trees described earlier in this article. Addition-
ally, we need to conduct similar comparison analyses on
libraries for other watersheds. Subject to these caveats, the
results presented here indicate that while the overall rates of
correct classification are higher for the DNA data than for the
ARA data (Tables 3–6), the resulting source predictions for
both data indicate similar TMDL surface water bacterial
contamination reduction strategies.3
ARTICLE IN PRESS
Table 5 – Resubstitution results for Potomac River Watershed ARA reference library isolates, with 70% thresholdprobability applied to terminal nodes of the classification tree model
Actual Predicted, with 70% threshold probability RCCa (%)
Human Livestock Pet Wildlife Unknown Total (%)
Human 34 2 3 2 13 54 (23.1) 81.1
Livestock 1 26 0 0 5 32 (13.7) 50.0
Pet 0 0 18 0 0 18 (7.7) 100.0
Wildlife 7 23 3 53 44 130 (55.6) 87.9
Total (%) 42 (17.9) 51 (21.8) 24 (10.3) 55 (23.5) 62 (26.5) 234
a Rate of correct classification. Overall percent correct ¼ 76.2%.
Table 6 – Resubstitution results for Potomac River Watershed DNA reference library isolates, with 70% thresholdprobability applied to terminal nodes of classification tree model
Actual Predicted, with 70% threshold probability RCCa (%)
Human Livestock Pet Wildlife Unknown Total (%)
Human 34 7 3 1 9 54 (23.1) 87.2
Livestock 0 27 1 0 4 32 (13.7) 92.9
Pet 0 1 17 0 0 18 (7.7) 100.0
Wildlife 6 7 7 77 33 130 (55.6) 83.7
Total (%) 40 (17.1) 42 (17.9) 28 (12.0) 78 (33.3) 46 (19.7) 234
a Rate of correct classification. Overall percent correct ¼ 82.4%.
Table 4 – Resubstitution results for Potomac River Watershed DNA reference library isolates
Actual Predicted RCCa (%)
Human Livestock Pet Wildlife Total (%)
Human 39 7 3 5 54 (23.1) 72.2
Livestock 1 29 1 1 32 (13.7) 90.6
Pet 0 1 17 0 18 (7.7) 94.4
Wildlife 8 12 7 103 130 (55.6) 83.1
Total (%) 48 (20.5) 49 (20.9) 28 (12.0) 109 (46.6) 234
a Rate of correct classification. Overall percent correct ¼ 80.3%.
WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43582
4.1. Use of bacterial species in predicting source ofcontamination
The rates of correct classification based on DNA data may
have the potential to be even larger if Enterococcus species are
taken into account. The library used in the current analysis
included three Enterococcus species (E. casseliflavus, E. faecalis,
and E. faecium), but we did not use species information to
develop the source-classification model. It has been sug-
gested that certain Enterococcus species are specific to certain
sources (Wheeler et al., 2002; Scott et al., 2005; Kuntz et al.,
2004; Chou et al., 2004). These differences could be exploited
for source classification, and we plan to investigate this
source–species relationship further.
4.2. Choice of model
The MST literature describes a variety of statistical analysis
methods used for developing source classification models.
Comparisons of the methods have not indicated that any one
method performs generally better than the others for all
libraries (Ritter et al., 2003; Albert et al., 2003; Harwood et al.,
2000). We have chosen to use the classification tree method
(Breiman et al., 1998) for two reasons. Firstly, the classification
tree method is an algorithmic method as opposed to a
method that requires an underlying stochastic model
(Breiman, 2001). Logistic regression and discriminant analysis
are examples of methods that require underlying stochastic
models. The most significant advantage of an algorithmic
ARTICLE IN PRESS
WAT E R R E S E A R C H 41 (2007) 3575– 3584 3583
method over a model-based method is the opportunity to
search for and evaluate every complex interaction among the
predictor variables in building the classification model. The
number of interactions among predictor variables that can be
considered in model-based methods is severely limited by the
number of data observations (isolates in the library for MST).
There is no such restriction for algorithmic methods.
Secondly, the classification tree method is a ‘‘supervised
learning’’ method (see, for example Ringner et al., 2002). That
is, the source identifications of isolates in the library are
explicitly used in building the source-classification model.
Other algorithmic methods, generally referred to as clustering
methods, are ‘‘unsupervised learning’’ methods. These meth-
ods search for clusters or groupings of similar data observa-
tions (isolates) without taking account of source
identification. The clusters that evolve may or may not align
well with bacteria sources.
4.3. Cost and labor
Developing DNA PFGE data for scat sample isolates is
significantly more expensive and time consuming than
developing antibiotic resistance profiles. Our laboratory fees
for both ARA and PFGE methodologies accurately reflect their
relative costs. The PFGE fee per known-source isolate is three
times that for ARA. PFGE development additionally entails
Enterococcus spp. determination, further increasing both cost
and labor relative to ARA, which does not require identifica-
tion of Enterococcus spp. Water sample analysis similarly
requires 3–4 times the funds for PFGE than for ARA for the
same amount of isolates. In addition to the cost differential,
the DNA methodology requires about 10 times the amount of
time to implement than ARA, thereby restricting the size of
the watershed that could be analyzed given a constant TMDL
time frame.
4.4. TMDL surface water bacterial contaminationmanagement
As mentioned previously, it is not enough to investigate the
reliability of classifying library isolates characterized by DNA
and ARA, regardless of method, without considering a context
for application. We used the application of MST in TMDL
surface water management programs as the context. The
Table 7 – Predicted source distribution of Potomac RiverWatershed library isolates using classification treemodeling
Without thresholdprobability
With 70% thresholdprobability
ARA (%) DNA (%) ARA (%) DNA (%)
Human 21.4 20.5 24.4 21.3
Livestock 29.1 20.9 29.7 22.3
Pet 10.3 12.0 14.0 14.9
Wildlife 39.3 46.6 32.0 41.5
classification model would be applied to water samples from
a targeted surface water body to estimate relative source
contributions of bacteria into the surface water. For applica-
tion in a TMDL program, the relative source contributions of
bacteria are quantified in terms of loads. Using the source
loads, bacteria loading reduction goals would be set for each
source in order to reduce the concentration of bacteria in the
targeted surface water to within regulatory levels.
We are currently developing DNA and antibiotic resistance
profiles for water samples collected from the Potomac River
Watershed. For illustrative purposes, Table 7 displays the
source percentages estimated by classifying the isolates in
the library as if they had been prepared from water samples
(i.e., sources unknown). These are the types of percentages
that would be used to establish bacterial contamination
reduction goals. The results obtained with a 70% threshold
classification probability indicate very little difference be-
tween ARA and DNA for the human source and the pet
source, but a larger difference for livestock, and an even larger
difference for wildlife. However, the ranking of relative
importance of the sources based on their percentages is the
same for ARA and DNA. Although the DNA results in this
example are more accurate than the ARA results (Tables 5
and 6), the additional accuracy does not implicate different
source management strategies relative to those indicated by
the ARA data.
5. Conclusions
Based on an analysis of a small library using the classification
tree method for developing a source-classification model, the
preliminary results show that source classification using DNA
data performs better than source classification using ARA
data. However, in a TMDL application, the practical differ-
ences in the predicted source distributions are small, and
thus indicate the same strategy for reducing bacterial
contamination in a TMDL program. In our laboratory, the
current cost per isolate of DNA PFGE analysis is four or five
times that of ARA. Identifying bacteria species in the DNA
PFGE analysis involves a process that further increases the
cost differential. Although our data and results are prelimin-
ary at this time, we conclude that questioning the value of
DNA data relative to ARA data for MST intended for
application in a TMDL program is justified, and the answer
may favor ARA data for this application.
R E F E R E N C E S
Albert, J.M., Munakata-Marr, J., Tenorio, L., Siegrist, R.L., 2003.Statistical evaluation of microbial source tracking dataobtained by rep-PCR DNA fingerprinting of Escherichia coli.Environ. Sci. Technol. 37, 4554–4560.
Breiman, L., 2001. Statistical modeling: two cultures. Stat. Sci. 16(3), 199–231.
Breiman, L., Freidman, J.H., Olshen, R.A., Stone, C.J., 1998.Classification and Regression Trees. Chapman & Hall/CRC,Boca Raton, FL.
Chou, C., Lin, Y., Su, J., 2004. Microbial indicators for differentia-tion of human- and pig-sourced fecal pollution. J. Environ. Sci.Health A Toxicol. Hazard Subst. Environ. Eng. 39 (6), 1415–1421.
ARTICLE IN PRESS
WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43584
Gibbons, J.D., Olkin, I., Steel, M., 1977. Selecting and OrderingPopulations, A New Statistical Methodology. Wiley, New York.
Graves, A.K., Hagedorn, C., Teetor, A., Mahal, M., Booth, A.M.,Reneau, R.B., 2002. Antibiotic resistance profiles to determinesources of fecal contamination in a rural Virginia watershed.J. Environ. Qual. 31, 1300–1308.
Hagedorn, C., Robinson, S.L., J. Filtz, R., Grubbs, S.M., Angier, T.A.,Reneau Jr., R.B., 1999. Determining sources of fecal pollution ina rural Virginia watershed with antibiotic resistance patternsin fecal streptococci. Appl. Environ. Microbiol. 65, 5522–5531.
Harwood, V.J., Whitlock, J., Withington, V., 2000. Classification ofantibiotic resistance patterns of indicator bacteria by discri-minant analysis: use in predicting the source of fecalcontamination in subtropical waters. Appl. Environ. Microbiol.66, 3698–3704.
Harwood, V.J., Wiggins, B., Hagedorn, C., Ellender, R.D., Gooch, J.,Kern, J., Samadpour, M., Chapman, A.C., Robinson, B.J.,Thompson, B.C., 2003. Phenotypic library-based microbialsource tracking methods: efficacy in the California collabora-tive study. J. Water Health 1 (4), 153–166.
Hassan, W.M., Wang, S.Y., Ellender, R.D., 2005. Methods toincrease fidelity of repetitive extragenic palindromic PCRfingerprint-based microbial source tracking efforts. Appl.Environ. Microbiol. 71, 512–518.
Hastie, T., Tibshirani, R., Friedman, J.H., 2001. The Elements ofStatistical Learning. Springer, New York.
Johnson, L.K., Brown, M.B., Carruthers, E.A., Ferguson, J.A.,Dombek, P.E., Sadowsky, M.J., 2004. Sample size, librarycomposition, and genotypic diversity among natural popula-tions of Escherichia coli from different animals influenceaccuracy of determining sources of fecal pollution. Appl.Environ. Microbiol. 70, 4478–4485.
Kuntz, R., Hartel, P., Rodgers, K., Segars, W., 2004. Presence ofEnterococcus faecalis in broiler litter and wild bird feces formicrobial source tracking. Water Res. 38, 3551–3557.
Lasalde, C., Rodriguez, R., Toranzos, G.A., 2005. Statisticalanalyses: possible reasons for unreliability of source trackingefforts. Appl. Environ. Microbiol. 71, 4690–4695.
Maryland Department of the Environment, 2005. Draft totalmaximum daily loads of fecal bacteria for the non-tidalAnacostia River Basin in Montgomery and Prince George’sCounties, Maryland. [Online.] /http://www.mde.state.md.us/assets/document/Anacostia_%20fc_TMDL-08-03-2005_PN(1).pdfS.
Murray, B.E., Singh, K.V., Heath, J.D., Sharma, B.R., Weinstock, G.M.,1990. Comparison of genomic DNAs of different enterococcal
isolates using restriction endonucleases with infrequent re-cognition sites. J. Clin. Microbiol. 28, 2059–2062.
Parveen, S., Murphree, R.L., Edmiston, L., Kaspar, C.W., Portier, K.M.,Tamplin, M.L., 1997. Association of multiple-antibiotic-resis-tance profiles with point and nonpoint sources of Escherichia coliin Apalachicola Bay. Appl. Environ. Microbiol. 63, 2607–2612.
Price, B., Venso, E.A., Frana, M.F., Greenberg, J., Ware, A., Currey, L.,2006. Classification tree method for microbial source trackingwith antibiotic resistance analysis data. Appl. Environ.Microbiol. 72, 3468–3475.
Ritter, K.J., Carruthers, E., Carson, C.A., Ellender, R.D., Harwood, V.J.,Kingsley, K., Nakatsu, C., Sadowsky, M., Shear, B., West, B.,Whitlock, J.E., Wiggins, B.A., Wilbur, J.D., 2003. Assessment ofstatistical methods used in library-based approaches to mi-crobial source tracking. J. Water Health 1 (4), 209–223.
Ringner, G., Peterson, C., Khan, J., 2002. Analyzing array data usingsupervised methods. Phamacogenomics 3 (3), 403–415.
Ross, S., 2005. A First Course in Probability, seventh ed. Prentice-Hall, Englewood Cliffs, NJ.
Scott, T.M., Jenkins, T., Lukasik, J., Rose, J., 2005. Potential use of ahost associated molecular marker in Enterococcus faecium as anindex of human fecal pollution. Environ. Sci. Technol. 39,283–287.
Simpson, J.M., Santo Domingo, J.W., Reasoner, D.J., 2002. Microbialsource tracking: state of the science. Environ. Sci. Technol. 36,5279–5288.
Stoeckel, D.M., Harwood, V.J., 2007. Performance, design, andanalysis in microbial source tracking studies. Appl. Environ.Microbiol. 73, 2405–2415.
US Environmental Protection Agency, 1986. Ambient WaterQuality Criteria for Bacteria—1986. EPA-440/5-84-002.
US Environmental Protection Agency, 2005. Microbial sourcetracking guide document, EPA/600-R-05-064, June 2005. USEnvironmental Protection Agency, Washington, DC.
Wheeler, A.L., Hartel, P., Godfrey, D., Hill, J., Segars, W., 2002.Potential of Enterococcus faecalis as a human fecal indicator formicrobial source tracking. J. Environ. Qual. 31, 1286–1293.
Whitlock, J.E., Jones, D.T., Harwood, V.J., 2002. Identification of thesources of fecal coliforms in an urban watershed usingantibiotic resistance analysis. Water Res. 36, 4273–4282.
Wiggins, B.A., Andrews, R.W., Conway, R.A., Corr, C.L., Dobratz, E.J.,Dougherty, D.P., Eppard, J.R., Knupp, S.R., Limjoco, M.C.,Mettenburg, J.M., Rinehardt, J.M., Sonsino, J., Torrijos, R.L.,Zimmerman, M.E., 1999. Use of antibiotic resistance analysis toidentify nonpoint sources of fecal pollution. Appl. Environ.Microbiol. 65, 3483–3486.