a comparison of ara and dna data for microbial source tracking based on source-classification models...

10
Available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/watres A comparison of ARA and DNA data for microbial source tracking based on source-classification models developed using classification trees Bertram Price a, , Elichia Venso b,c , Mark Frana b , Joshua Greenberg a , Adam Ware a a Price Associates, Inc., One N. Broadway Ste 406, White Plains, NY 10601, USA b Department of Biological Sciences, Salisbury University, Salisbury, MD 21801, USA c Environmental Health Sciences, Salisbury University, Salisbury, MD 21801, USA article info Article history: Received 16 January 2007 Received in revised form 8 May 2007 Accepted 10 May 2007 Available online 24 May 2007 Keywords: MST Classification trees DNA ARA Library size Reliability Enterococcus spp. abstract The literature on microbial source tracking (MST) suggests that DNA analysis of fecal samples leads to more reliable determinations of bacterial sources of surface water contamination than antibiotic resistance analysis (ARA). Our goal is to determine whether the increased reliability, if any, in library-based MST developedwith DNA data is sufficient to justify its higher cost, where the bacteria source predictions are used in TMDL surface water management programs. We describe an application of classification trees for MST applied to ARA and DNA data from samples collected in the Potomac River Watershed in Maryland. Conclusions concerning the comparison of ARA and DNA data, although preliminary at the current time, suggest that the added cost of obtaining DNA data in comparison to the cost of ARA data may not be justified, where MST is applied in TMDL surface water management programs. & 2007 Elsevier Ltd. All rights reserved. 1. Introduction Microbial contamination of surface waters resulting from fecal materials in a watershed is a major environmental health concern. Microbial source tracking (MST), the initial investigation for reducing surface water contamination, includes identifying the sources of fecal material in a watershed and determining how much each source contri- butes to the contamination. Based on MST information, fecal bacteria sources may be targeted for reduction, which in turn would be expected to reduce microbial surface water loading to acceptable levels, such as those consistent with EPA’s total maximum daily load (TMDL) guidelines (US EPA, 1986). 1 Briefly, fecal (scat) samples are collected from known potential sources of surface water bacterial contamination in a specific watershed of interest. The bacteria from these sources are isolated, and may be analyzed by a number of phenotypic and genotypic methods (Stoeckel and Harwood, 2007). The results of these analyses constitute a reference library of source-specific bacterial isolates, which is used to develop a source-classification model. The model is then applied to water samples from the targeted surface water body in order to estimate relative contributions of bacterial contam- ination into the surface water by each source. For application in a TMDL program, the relative sources are then quantified in terms of loads. Using the source loads, bacteria-loading ARTICLE IN PRESS 0043-1354/$ - see front matter & 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.watres.2007.05.026 Corresponding author. Tel.: +1 914 686 7975; fax: +1 914 686 7977. E-mail address: [email protected] (B. Price). 1 /http://www.epa.gov/owow/tmdl/intro.htmlS; Maryland Department of the Environment (2005). WATER RESEARCH 41 (2007) 3575– 3584

Upload: bertram-price

Post on 30-Oct-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

ARTICLE IN PRESS

Available at www.sciencedirect.com

WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 4

0043-1354/$ - see frodoi:10.1016/j.watres

�Corresponding auE-mail address:1 /http://www.e

journal homepage: www.elsevier.com/locate/watres

A comparison of ARA and DNA data for microbial sourcetracking based on source-classification models developedusing classification trees

Bertram Pricea,�, Elichia Vensob,c, Mark Franab, Joshua Greenberga, Adam Warea

aPrice Associates, Inc., One N. Broadway Ste 406, White Plains, NY 10601, USAbDepartment of Biological Sciences, Salisbury University, Salisbury, MD 21801, USAcEnvironmental Health Sciences, Salisbury University, Salisbury, MD 21801, USA

a r t i c l e i n f o

Article history:

Received 16 January 2007

Received in revised form

8 May 2007

Accepted 10 May 2007

Available online 24 May 2007

Keywords:

MST

Classification trees

DNA

ARA

Library size

Reliability

Enterococcus spp.

nt matter & 2007 Elsevie.2007.05.026

thor. Tel.: +1 914 686 7975;[email protected]/owow/tmdl/intro

a b s t r a c t

The literature on microbial source tracking (MST) suggests that DNA analysis of fecal

samples leads to more reliable determinations of bacterial sources of surface water

contamination than antibiotic resistance analysis (ARA). Our goal is to determine whether

the increased reliability, if any, in library-based MST developed with DNA data is sufficient

to justify its higher cost, where the bacteria source predictions are used in TMDL surface

water management programs. We describe an application of classification trees for MST

applied to ARA and DNA data from samples collected in the Potomac River Watershed in

Maryland. Conclusions concerning the comparison of ARA and DNA data, although

preliminary at the current time, suggest that the added cost of obtaining DNA data in

comparison to the cost of ARA data may not be justified, where MST is applied in TMDL

surface water management programs.

& 2007 Elsevier Ltd. All rights reserved.

1. Introduction

Microbial contamination of surface waters resulting from

fecal materials in a watershed is a major environmental

health concern. Microbial source tracking (MST), the initial

investigation for reducing surface water contamination,

includes identifying the sources of fecal material in a

watershed and determining how much each source contri-

butes to the contamination. Based on MST information, fecal

bacteria sources may be targeted for reduction, which in turn

would be expected to reduce microbial surface water loading

to acceptable levels, such as those consistent with EPA’s total

maximum daily load (TMDL) guidelines (US EPA, 1986).1

r Ltd. All rights reserved.

fax: +1 914 686 7977.c.com (B. Price)..htmlS; Maryland Depart

Briefly, fecal (scat) samples are collected from known

potential sources of surface water bacterial contamination in

a specific watershed of interest. The bacteria from these

sources are isolated, and may be analyzed by a number of

phenotypic and genotypic methods (Stoeckel and Harwood,

2007). The results of these analyses constitute a reference

library of source-specific bacterial isolates, which is used to

develop a source-classification model. The model is then

applied to water samples from the targeted surface water body

in order to estimate relative contributions of bacterial contam-

ination into the surface water by each source. For application in

a TMDL program, the relative sources are then quantified in

terms of loads. Using the source loads, bacteria-loading

ment of the Environment (2005).

ARTICLE IN PRESS

WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43576

reduction goals are set for each source in order to reduce the

concentration of bacterial contaminant in the targeted surface

water to within regulatory levels.

The two predominant types of data that have been used for

MST are based on antibiotic resistance analysis (ARA) and

DNA fingerprinting (Graves et al., 2002; Hagedorn et al., 1999;

Harwood et al., 2000, 2003; Hassan et al., 2005; Parveen et al.,

1997; Simpson et al., 2002; US EPA, 2005; Whitlock et al., 2002;

Wiggins et al., 1999). Our interest is in comparing the

reliability of these two types of data for library-based MST.

While reliability for an application of MST is based on the

probability of correct classification of surface water samples

containing bacteria among a set of potential sources, a

meaningful evaluation of reliability requires the context of a

specific application. The context we employ is the application

of MST in TMDL surface water management programs. Our

objective is to determine whether increased reliability, if any,

of library-based MST developed with DNA data is sufficient to

justify its higher cost, where source predictions are used in

TMDL surface water management programs.

The probability of correct source classification depends on

various factors including (1) the index bacteria organism

employed; (2) the number of sources; (3) the size of the library

used to develop the MST model (i.e., the number of samples

collected and the corresponding number of isolates prepared

and analyzed); (4) the degree to which library samples

represent the sources in the watershed; (5) the measurement

method used to characterize the isolates; and (6) the

statistical model used for bacterial source classification

(Johnson et al., 2004; Lasalde et al., 2005). To focus attention

on the comparison of ARA data versus pulsed-field gel

electrophoresis (PFGE) DNA data, we held a few of these

factors constant: the index organism was Enterococcus spp.;

there were four bacterial sources (human, pet, livestock, and

wildlife); and the source classification model was developed

using classification trees. This analysis of ARA versus DNA

data for MST provides a partial answer for comparing the

rates of correct classification for each type of data. The results

suggest that while source classification using DNA data has a

higher correct classification rate than ARA data, this advan-

tage does not necessarily translate into different strategies for

achieving TMDL objectives, and may therefore not be

sufficient to justify the higher cost of obtaining DNA data.

2. Materials and methods

2.1. Sample collection and library representation of thewatershed

Scat samples should be collected in a manner that provides

coverage for the entire watershed. Stratification and propor-

tional sampling by strata could be used if information were

available indicating that certain areas within the watershed

were more likely as sources of fecal bacteria than other areas.

Also, if information were available on the relative population

sizes of the contributing sources (i.e., pet, human, livestock,

and wildlife for the watershed), the source categories could be

used to define strata and a proportional sampling scheme for

these strata could be designed. However, in any initial MST

investigation, it is unlikely that the information needed to

define strata would be well developed. Therefore, a sampling

design based on a combination of expert scientific judgment

and convenience is a practical choice. In our example, scat

collection focused on the most likely sources of fecal

contamination for the Potomac River and included 283 scat

samples from humans (sewage), and a variety of different

livestock, pets, and wildlife.

2.2. Isolation and identification of bacteria

Enterococcus spp. was the indicator organism used for this

MST study (Wheeler et al., 2002). Up to eight enterococcal

isolates were obtained from each of the 283 scat samples

collected in the Potomac River Watershed from the afore-

mentioned sources. Bacterial isolates that produced reddish

colonies when grown on m-Enterococcus Agar (DifcoTM)

and turned EnterococcoselTM broth (BBLTM) black (esculin

positive) were presumptive Enterococcus genus and were used

for ARA.

Prior to DNA fingerprinting, the same enterococci isolates

used for ARA were further identified to species using carbon

source metabolic fingerprint analysis by BiOLOG MicroLogTM

System (Hayward, CA). Only those isolates specifically identi-

fied as Enterococcus casseliflavus, Enterococcus faecalis, and

Enterococcus faecium were analyzed for DNA fingerprints using

PFGE, as these were the most abundant species. To ensure fair

comparison of ARA and DNA data, only isolates identified to

those three bacterial species were used for both types of

analysis (n ¼ 234).

2.3. Antibiotic resistance analysis

Enterococcus spp. isolates were inoculated into Enterococcosel

broth contained in each of 48 wells of a 96-well plate. Plates

were then incubated overnight at 37 1C. By using a replicate

plater, isolates were transferred from the 48 wells onto 31

agar plates, each containing a different antibiotic-concentra-

tion combination. Agar plates were incubated at 37 1C for 24 h

and then individual isolates were scored for growth or no

growth. Isolates that grew on a particular plate were

considered resistant to the antibiotic at that concentration,

and isolates that did not grow were considered susceptible.

2.4. DNA fingerprint analysis by PFGE

DNA was extracted from the bacterial isolates and embedded

into agarose plugs as previously described (Murray et al.,

1990). Bacterial DNA was digested using the restriction

enzyme Sma I. The DNA fragments generated from these

isolates were separated by PFGE using the CHEF MapperTM

(BioRad Laboratories, Hercules, CA). Gels were photographed

and banding patterns were quantified using BioNumericss,

Applied Maths, Inc. (Austin, TX).

2.5. Data characteristics

ARA and DNA laboratory analyses of isolates may be recorded

either as binary or as interval data. For ARA, isolate bacteria

are treated with different fixed concentrations of antibiotics.

ARTICLE IN PRESS

WAT E R R E S E A R C H 41 (2007) 3575– 3584 3577

When the bacteria grow in the presence of a particular

antibiotic at a particular concentration, this is recorded as a 1,

signifying resistance. When the bacteria do not grow in the

presence of a particular antibiotic-concentration treatment,

this is recorded as a 0, signifying susceptibility. These data

yield an antibiotic resistance profile consisting of 1’s and 0’s

for each antibiotic at each concentration. Alternatively, the

ARA laboratory results may be condensed by recording the

highest concentration of each antibiotic of those tested for

which the bacteria are resistant, thereby yielding an antibiotic

resistance profile consisting of interval data. For example, if

bacteria are treated with 10 antibiotics at 5 different

concentrations each, then the data may be recorded as 50

binary variables or 10 interval variables. If we impose the

biologically indicated assumption that bacteria will be

resistant to all concentrations of an antibiotic less than the

highest concentration for which the bacteria are resistant,

then there is a one-to-one relationship between ARA binary

and interval data as long as no data are missing. Our dataset

was complete, and we checked it to make sure that this

assumption held true. We therefore used the interval data,

due to its lower dimensionality.

For DNA fingerprint analysis, photographs of the electro-

phoresis gel runs capture arrays of bands corresponding to

different DNA fragment lengths. The intensities of the bands

relate the density of these fragments. The results may be

recorded either as interval data, the grayscale intensity of the

bands at different fragment lengths, or as binary data, the

presence or absence of the bands at different fragment

lengths. Given that binary and interval data for DNA

fingerprint analysis have the same dimension, interval data,

due to their higher information content, are preferable for

statistical analysis.

2.6. Statistical method—classification trees

2.6.1. AlgorithmThe classification tree method builds classification models by

recursively splitting into two the reference library of isolates

such that the resulting subsets, termed nodes, are increas-

ingly homogeneous with respect to isolate sources (Breiman

et al., 1998; Hastie et al., 2001; Price et al., 2006).2 The first split

divides the reference library of isolates into two nodes by

considering every binary split defined by the outcome

associated with the range of values for every variable. A

variable is selected for the split that maximizes homogeneity

of isolate sources within each of the two resultant nodes. The

same procedure is applied to each of the resultant nodes, and

a tree of recursively partitioned data is grown until a stopping

rule is invoked or until each node is homogeneous by source

category and cannot be split any further. The collection of

nodes at the ends of the branches of the tree, called terminal

nodes, constitutes a classification model. That is, each

terminal node is characterized by a unique profile of variables

and their range of values, and is associated with one source

category, the most prevalent source category represented by

isolates in the node. Consequently, any isolate, whether from

2 We used Classification and Regression Trees (CARTs) soft-ware developed by Salford Systems to build classification trees.

the library or a water sample of unknown source, that has the

same variable profile as that specified by a particular terminal

node would be assigned the source associated with that

terminal node.

Various approaches for determining when to conclude the

node-splitting process have been suggested and analyzed

(Breiman et al., 1998). Stopping criteria, or rules for determin-

ing when to stop splitting nodes, are typically based on a

measure of source homogeneity for isolates within a node.

Source homogeneity achieves its maximum when all isolates

in a node have the same source. A node where an additional

split would not yield a specified minimum increase in

homogeneity would be a terminal node.

Hastie et al. (2001) describe three criteria or indexes that

address homogeneity. As formulated, the indexes measure

the complement of homogeneity, which Hastie et al. refer to

as impurity. Homogeneity is maximized when impurity is

minimized. The three impurity indexes are Misclassification

error, Gini index, and Cross-entropy deviance. The optimal

split of a node at any stage of model development is the one

that minimizes impurity as measured by the chosen index.

We relied on the Gini index for developing the classification

tree models in our examples.

The Gini index is calculated for a node labeled m asX

k

pmkð1� pmkÞ,

where p is the proportion of isolates from source k in node m.

The minimum value taken by the Gini index is 0, which

occurs when all isolates in a node are from the same source

(i.e., minimum impurity corresponds to maximum homo-

geneity). The maximum of the Gini index occurs when

the isolates in a node are equally distributed among the

sources.

At this time, we know of no general guidance concerning a

preference for one of the impurity indexes for MST modeling.

Following the general suggestion that model building is an

exploratory data analysis process, experimentation with each

of the indexes may be beneficial.

2.6.2. Source identification for terminal nodesThe classification model, equivalently the set of rules for

classifying isolates by source, is embodied in the collection of

terminal nodes. Each terminal node in a tree represents

isolates with a particular combination of outcomes for the

classification variables and assigns the source with highest

probability in the node to all such isolates.

The source probabilities in a terminal node are posterior

probabilities of the events: ‘‘source, given the pattern of

variables in the terminal node.’’ The posterior probability for a

particular source S in terminal node T is

PðSjTÞ ¼ ½PðTjSÞ � PðSÞ�=PðTÞ, (1)

where P(SjT) is the posterior probability that an isolate

comes from source S given that the isolate is in terminal

node T; P(TjS) is the conditional probability that an isolate

is in terminal node T given that it comes from source

S; P(S) is the prior probability that an isolate comes from

source S; and P(T) is the overall probability that an isolate is in

node T.

ARTICLE IN PRESS

WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43578

Where there are four source categories, the posterior

probability for the ith source would be calculated as

PðSijTÞ ¼PðTjSiÞ � PðSiÞPj

PðTjSjÞ � PðSjÞ. (2)

P(TjSj) can be estimated from the isolates in the reference

library as {nj/Nj}, where nj is the number of isolates from

source j in terminal node T and Nj is the total number of

isolates from source j in the reference library. P(S), the prior

probability distribution of source identity, must be specified

as a modeling parameter that is not necessarily determined

by the library data. Therefore, the prior probabilities may play

a significant role in determining source predictions in the

classification tree model.

There are two immediately apparent choices for assigning

values to the prior probabilities: (1) P(Si) is proportional to

the population size of the ith source in the watershed, which

if estimated from the isolates in the reference library

would be Ni/P

Nj, where {Ni, i ¼ 1, 2, 3, 4} are the numbers of

isolates from source i in the library; (2) equal values, 0.25 for

each of the four sources, which is referred to as a non-

informative prior used when little or no information is

available about the distribution of sources in the watershed.

If the source proportions from the library were used as

prior probabilities in Eq. (2), the posterior source proba-

bilities would be the proportions of the sources for isolates in

that node. Then, the node would be represented by the

majority source among the isolates in the node. Subse-

quently, any isolate of unknown source that would be

classified by the tree classification model into the terminal

node under discussion would be assigned the majority source

for that node.

However, using the source proportions among the isolates

in the reference library as prior probabilities is not advisable

in most circumstances. It is unlikely that the proportions of

isolates in the reference library from a particular source

would reflect the true proportions of potential sources of

bacteria in the watershed. Even if the collection of samples

provides adequate coverage of the sources in the watershed,

the numbers of isolates for each source are unlikely to be

proportional to actual source distribution.

The second obvious choice for prior probabilities, equal

probabilities of 0.25 for each of the four sources in the

Potomac analysis, is a more acceptable choice. Absent specific

reliable data on the source contributions of bacteria from the

watershed, the equal priors properly reflect ignorance con-

cerning the source contributions and, in that respect, lead to

an unbiased classification model. Under the equal prior

probability choice, the posterior probabilities would be

calculated using Eq. (2), with each P(Sj) replaced by 0.25. The

posterior probability for the ith source would be

PðSijTÞ ¼ ðni=NiÞ=Xðnj=NjÞ. (3)

2.6.3. Model stabilityA classification model is stable if it consistently classifies new

isolates from known sources accurately. Stability depends on

the size of the reference library, how well the isolates in the

library represent the watershed, the number of terminal

nodes in the classification model, and the number of isolates

in each terminal node. A model with a large number of

terminal nodes will classify the isolates in the library with

great accuracy, but if the number of isolates in each terminal

node is small, the model may not perform well on samples

that were not in the library. For classification tree models in

general, model stability involves a tradeoff among library size,

number of terminal nodes, and number of isolates per

terminal node. Stability may be advanced by imposing size

limitations on the terminal nodes. In our example, we limited

terminal node size indirectly by requiring that a node contain

at least 10 isolates in order to be considered for further

splitting. Limiting the number of isolates in a terminal node is

equivalent to imposing a statistical precision requirement on

the estimates of posterior probabilities for that node. This

requirement can be used to determine appropriate data

reference library size, an important issue addressed in

Section 2.6.4.

2.6.4. Reference library sizeThe size of the reference library is determined by specifying

the number of field samples to be collected, the number of

isolates to be grown from each sample, and the number of

isolates to be analyzed. Determining the library size may

follow the usual approach for setting requirements for the

number of observations in a statistical sample: (1) state a

well-defined objective for the statistical analysis; (2) select a

criterion that measures the performance of the statistical

analysis method; and (3) select a large probability for assuring

that the criterion in (2) is met. This general approach,

although conceptually straightforward for typical statistical

objectives such as estimating parameters or testing hypoth-

eses, can be complicated in application and has not been

applied in the MST literature.

However, for source classification using the classification

tree model, library size may be determined by focusing on the

posterior probabilities in terminal nodes. Each terminal node

is associated with one source, the source corresponding to the

largest posterior probability among sources of the isolates in

the node. This association is determined by a statistical

estimate of posterior probabilities for the node. Therefore, the

statistical objective for determining the size of the library is to

assure with a high probability that the source corresponding

to the largest estimated posterior probability in a terminal

node is, in fact, the source for the true largest posterior

probability among all sources. One additional requirement

must be imposed for determining library size. Correct

selection of the source with the largest posterior probability

is assured only if the largest true posterior probability is larger

than the second largest true posterior probability by a

specified amount. If a small difference were specified, a larger

library would be required than if a larger difference were

specified. In addition, the probability of correct selection

(P(CS)) should be relatively large, at least 0.90.

This formulation for determining library size for MST is an

adaptation of the ranking and selection problem in statistics

(Gibbons et al., 1977). The ranking and selection objective is to

determine the number of samples required to select the

‘‘best’’ population from among a number of candidate

populations. For MST modeling, the populations are the

ARTICLE IN PRESS

WAT E R R E S E A R C H 41 (2007) 3575– 3584 3579

sources and ‘‘best’’ means largest posterior probability of

source for a terminal node. The number of samples must be

large enough to satisfy the two requirements mentioned

above: (1) correct selection should be assured when the

largest value of the posterior probability exceeds the second

largest by a specified amount; and (2) the probability of

making the correct selection subject to this requirement is at

least a pre-specified probability (e.g., 0.90, 0.95, or 0.99).

In each terminal node, we use the posterior probability to

identify one source among the four sources as the represen-

tative of that node. Where the prior probabilities are constant,

which we adopt in our modeling approach, the estimates of

the four posterior probabilities in a particular terminal node

are given by Eq (3). To address the number of samples issue,

we adopt one simplification. We proceed by assuming that

the numbers of isolates in the library from each source are

the same (i.e., Ni ¼ N for all i). With this simplification, the

estimates of the posterior probabilities become {ni/P

ni},

the ratios of the numbers of isolates for each source in the

terminal node divided by the total number of isolates in the

terminal node. Additionally, the data in the terminal node

have a multinomial distribution (Ross, 2005).

We apply sample size tables developed for selecting the

category of a multinomial distribution with the largest

probability (Gibbons et al., 1977) to determine the number of

isolates in each terminal node. As an example, the number of

isolates required in a terminal node to assure correct

selection, with probability equal to 0.90, of the source

category with the largest posterior probability when the

largest posterior probability exceeds the second largest

posterior probability by a factor of 2.0 is 43 (Table 1).

To determine the total number of samples and isolates for

the library, we need to specify the number of isolates per

sample and the number of terminal nodes we allow in the

classification model. For example, consider a design with 10

isolates per sample and a classification model with 20

terminal nodes. With 20 terminal nodes, 43 isolates per

Table 1 – Total number of isolates per terminal node toassure a specified probability of selecting the source withthe largest probability

Ratio of largestprobability to secondlargest probability

Probability of correctselection—P(CS)

0.90 0.95 0.99

1.2 692 979 1660

1.4 196 278 471

1.6 98 139 235

1.8 61 87 147

2.0 43 61 104

2.2 33 46 79

2.4 26 37 63

2.6 22 31 53

2.8 19 26 45

3.0 16 23 39

Source: Gibbons et al. (1977).

terminal node translate into 860 total isolates in the library, or

86 samples with 10 isolates each. The number of samples for

each source would be approximately 22. If P(CS) were required

to be 0.99, then the number of isolates per terminal node

would be 104 (Table 1). For 20 terminal nodes and 10 isolates

per sample, the library must consist of approximately 208

samples and 2080 isolates. Our data from the Potomac River

Watershed are still being developed, and currently contain

fewer isolates than recommended by this method of deter-

mining appropriate data library size.

2.6.5. Classification probability thresholdsWhere the number of isolates in the library is determined by

the method described above, the source associated with each

terminal node will be, with a high probability, correctly

identified as the source with the largest posterior probability

among the sources with isolates in that node. In the above

example, we chose a design that assures correct selection of

the source provided the largest posterior probability is at least

twice the second largest posterior probability. However, even

though the terminal node has been correctly assigned a

source, the probability of source membership for isolates in

that node may not be large enough for source classification

with a high level of confidence. Consider two examples. In the

first example, the estimate of the largest posterior probability

is 0.80. In the second example, the estimate of the largest

posterior probability is 0.50. Whereas we may have comfort in

assigning an isolate to a source where the probability of

belonging to that source is 0.80, we would be uncertain about

the correctness of the assignment if the probability were 0.50.

This uncertainty can be reduced by establishing a threshold

for the maximum probability in a terminal node before

accepting the node as a basis for assigning isolates to the

source associated with the node. If the maximum posterior

probability in a terminal node does not exceed the threshold,

isolates in that terminal node would be classified as ‘‘source

unknown.’’

For example, consider a node that contains 6 pet, 6 human,

12 livestock, and 36 wildlife isolates. The posterior probabil-

ities would be 10% pet, 10% human, 20% livestock, and 60%

wildlife. Without any threshold, all isolates falling into this

node would be classified as wildlife. If the threshold were 55%,

the node and all isolates in it would also be classified as

wildlife. With a threshold of 70%, isolates falling into this

node would be classified as ‘‘source unknown.’’

A threshold probability reflects our confidence in the

classification scheme. While increasing the threshold in-

creases the correct classification rate, it also increases the

number of isolates classified as unknown. The tradeoff

between correct classification rate and number of unknowns

may be investigated for each library and source classification

model by varying the probability threshold value.

3. Results

Results for the classification tree method based on resub-

stitution are presented in the figures and tables. We provide

these preliminary results as part of an ongoing comparison of

ARA and DNA data for MST.

ARTICLE IN PRESS

WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43580

The library consisted of 234 Enterococcus isolates with

complete data derived from 55 samples, with the source

distribution shown in Table 2.

We developed source-classification models for the ARA and

DNA libraries using the classification trees methodology. For

illustration purposes, we provide abbreviated versions of the

classification tree models for the two types of data (Figs. 1

and 2). When viewing these figures it is important to keep in

mind that we used equal prior source probabilities. Therefore,

node source distribution does not reflect the posterior

probability distribution of sources in the node.

In Fig. 1, there are four terminal nodes stemming from

three splits of the ARA data library. In our model, these nodes

Table 2 – Distribution of samples across sources

Potomac River Watershed

Source Samples Isolates

Human 14 54

Livestock 10 32

Pet 5 18

Wildlife 26 130

Total 55 234

CEPHALOTHIN <= 12.500

N=59

STREPTOMYCIN<= 20.000

N=22

S

CEPHALOTHIN >

N=135

CHLORTETRACYCLINE<= 70.000

N=194

Class

human

livestock

pet

wildlife

Cases %

42 21.6

16.532

1 0.5

119 61.3

Class = human

Class

human

livestock

pet

wildlife

Cases %

18 81.8

9.12

0 0.0

2 9.1

Class = livestock

Class

human

livestock

pet

wildlife

Cases %

2 3.4

40.724

0 0.0

33 55.9

Class

human

livestock

pet

wildlife

Cases

40

8

1

86

Clas

hum

lives

pet

wildl

Fig. 1 – First three splits in classification tree for Potomac Rover

were split further, eventually yielding 28 terminal nodes. We

employed a stopping rule requiring a minimum of 10 isolates

in a node before splitting.

In Fig. 2, there are four terminal nodes stemming from three

splits of the DNA data library. In our model, these nodes

were split further, eventually yielding 19 terminal nodes.

We employed the same stopping rule for DNA data as for ARA

data.

Tables 3 and 4 show the resubstitution results, comparing

the bacterial source-classification schemes for ARA and DNA

data to the actual sources. Modeling the DNA data yielded a

higher overall rate of correct reclassification (RCC), though

modeling the ARA data resulted in higher rates of correct

reclassification for pet and livestock isolates.

Applying a 70% threshold probability to the terminal

nodes in both models improved the rate of correct reclassi-

fication for both the ARA and DNA library data (Tables 5

and 6).

4. Discussion

We are currently engaged in a research project to compare the

efficiency of ARA data versus DNA data for MST within the

context of TMDL surface water bacterial contamination

TREPTOMYCIN> 20.000

N=113

12.500

CHLORTETRACYCLINE > 70.000

Class = pet

Class

human

livestock

pet

wildlife

N=40

N=234

Cases %

12 30.0

0.00

17 42.5

11 27.5

Class = wildlife

Class

human

livestock

pet

wildlife

Cases %

22 19.5

5.36

1 0.9

84 74.3

%

29.6

5.9

0.7

63.7

s

an

tock

ife

Cases %

54 23.1

13.732

18 7.7

130 55.6

Watershed ARA data. Antibiotic concentrations are in lg/ml.

ARTICLE IN PRESS

Table 3 – Resubstitution results for Potomac River Watershed ARA reference library isolates

Actual Predicted RCCa (%)

Human Livestock Pet Wildlife Total (%)

Human 37 4 3 10 54 (23.1) 68.5

Livestock 1 31 0 0 32 (13.7) 96.9

Pet 0 0 18 0 18 (7.7) 100.0

Wildlife 12 33 3 82 130 (55.6) 63.1

Total (%) 50 (21.4) 68 (29.1) 24 (10.3) 92 (39.3) 234

a Rate of correct classification. Overall percent correct ¼ 71.8%.

KB44_4 <= 10.000

KB92_5 <= 55.100 KB92_5 > 55.100

KB44_4 > 10.000

KB63_5 <= 83.300KB63_5 > 83.300

Class = human

Class Cases %human 54 23.1

livestock 32 13.7

pet 18 7.7

wildlife 130 55.6

N=234

Class Cases %

human 21 15.1

livestock 24 17.3

pet 18 12.9

wildlife 76 54.7

N=139

Class Cases %human 7 8.6

livestock 8 9.9

pet 17 21.0wildlife 49 60.5

N=81

Class Cases %human 33 34.7

livestock 8 8.4

pet 0 0.0

wildlife 54 56.8

N=95

Class = livestock

Class Cases %

human 14 24.1

livestock 16 27.6

pet 1 1.7

wildlife 27 46.6

N=58

Class = pet

Class Cases %human 3 7.1

livestock 3 7.1

pet 17 40.5

wildlife 19 45.2

N=42

Class = wildlife

Class Cases %human 4 10.3

livestock 5 12.8

pet 0 0.0

wildlife 30 76.9

N=39

Fig. 2 – First three splits in classification tree for Potomac River Watershed DNA data. Splitting variables are DNA fragment

length measured in kilobases.

3 While our results are for resubstitution, cross-validationprovides a better measure of model performance. When werepeat the analysis with a larger library we will measureperformance using cross-validation. For the current model, it islikely that RCC’s for cross-validation would be reduced for bothtypes of data, but DNA RCCs would remain larger. Therefore, theresubstituion results are adequate for this preliminary analysis.

WAT E R R E S E A R C H 41 (2007) 3575– 3584 3581

management. The results we report in this article are a

preview of the results we are developing on this topic.

Currently, the library contains fewer isolates than recom-

mended by the library size determination method for

classification trees described earlier in this article. Addition-

ally, we need to conduct similar comparison analyses on

libraries for other watersheds. Subject to these caveats, the

results presented here indicate that while the overall rates of

correct classification are higher for the DNA data than for the

ARA data (Tables 3–6), the resulting source predictions for

both data indicate similar TMDL surface water bacterial

contamination reduction strategies.3

ARTICLE IN PRESS

Table 5 – Resubstitution results for Potomac River Watershed ARA reference library isolates, with 70% thresholdprobability applied to terminal nodes of the classification tree model

Actual Predicted, with 70% threshold probability RCCa (%)

Human Livestock Pet Wildlife Unknown Total (%)

Human 34 2 3 2 13 54 (23.1) 81.1

Livestock 1 26 0 0 5 32 (13.7) 50.0

Pet 0 0 18 0 0 18 (7.7) 100.0

Wildlife 7 23 3 53 44 130 (55.6) 87.9

Total (%) 42 (17.9) 51 (21.8) 24 (10.3) 55 (23.5) 62 (26.5) 234

a Rate of correct classification. Overall percent correct ¼ 76.2%.

Table 6 – Resubstitution results for Potomac River Watershed DNA reference library isolates, with 70% thresholdprobability applied to terminal nodes of classification tree model

Actual Predicted, with 70% threshold probability RCCa (%)

Human Livestock Pet Wildlife Unknown Total (%)

Human 34 7 3 1 9 54 (23.1) 87.2

Livestock 0 27 1 0 4 32 (13.7) 92.9

Pet 0 1 17 0 0 18 (7.7) 100.0

Wildlife 6 7 7 77 33 130 (55.6) 83.7

Total (%) 40 (17.1) 42 (17.9) 28 (12.0) 78 (33.3) 46 (19.7) 234

a Rate of correct classification. Overall percent correct ¼ 82.4%.

Table 4 – Resubstitution results for Potomac River Watershed DNA reference library isolates

Actual Predicted RCCa (%)

Human Livestock Pet Wildlife Total (%)

Human 39 7 3 5 54 (23.1) 72.2

Livestock 1 29 1 1 32 (13.7) 90.6

Pet 0 1 17 0 18 (7.7) 94.4

Wildlife 8 12 7 103 130 (55.6) 83.1

Total (%) 48 (20.5) 49 (20.9) 28 (12.0) 109 (46.6) 234

a Rate of correct classification. Overall percent correct ¼ 80.3%.

WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43582

4.1. Use of bacterial species in predicting source ofcontamination

The rates of correct classification based on DNA data may

have the potential to be even larger if Enterococcus species are

taken into account. The library used in the current analysis

included three Enterococcus species (E. casseliflavus, E. faecalis,

and E. faecium), but we did not use species information to

develop the source-classification model. It has been sug-

gested that certain Enterococcus species are specific to certain

sources (Wheeler et al., 2002; Scott et al., 2005; Kuntz et al.,

2004; Chou et al., 2004). These differences could be exploited

for source classification, and we plan to investigate this

source–species relationship further.

4.2. Choice of model

The MST literature describes a variety of statistical analysis

methods used for developing source classification models.

Comparisons of the methods have not indicated that any one

method performs generally better than the others for all

libraries (Ritter et al., 2003; Albert et al., 2003; Harwood et al.,

2000). We have chosen to use the classification tree method

(Breiman et al., 1998) for two reasons. Firstly, the classification

tree method is an algorithmic method as opposed to a

method that requires an underlying stochastic model

(Breiman, 2001). Logistic regression and discriminant analysis

are examples of methods that require underlying stochastic

models. The most significant advantage of an algorithmic

ARTICLE IN PRESS

WAT E R R E S E A R C H 41 (2007) 3575– 3584 3583

method over a model-based method is the opportunity to

search for and evaluate every complex interaction among the

predictor variables in building the classification model. The

number of interactions among predictor variables that can be

considered in model-based methods is severely limited by the

number of data observations (isolates in the library for MST).

There is no such restriction for algorithmic methods.

Secondly, the classification tree method is a ‘‘supervised

learning’’ method (see, for example Ringner et al., 2002). That

is, the source identifications of isolates in the library are

explicitly used in building the source-classification model.

Other algorithmic methods, generally referred to as clustering

methods, are ‘‘unsupervised learning’’ methods. These meth-

ods search for clusters or groupings of similar data observa-

tions (isolates) without taking account of source

identification. The clusters that evolve may or may not align

well with bacteria sources.

4.3. Cost and labor

Developing DNA PFGE data for scat sample isolates is

significantly more expensive and time consuming than

developing antibiotic resistance profiles. Our laboratory fees

for both ARA and PFGE methodologies accurately reflect their

relative costs. The PFGE fee per known-source isolate is three

times that for ARA. PFGE development additionally entails

Enterococcus spp. determination, further increasing both cost

and labor relative to ARA, which does not require identifica-

tion of Enterococcus spp. Water sample analysis similarly

requires 3–4 times the funds for PFGE than for ARA for the

same amount of isolates. In addition to the cost differential,

the DNA methodology requires about 10 times the amount of

time to implement than ARA, thereby restricting the size of

the watershed that could be analyzed given a constant TMDL

time frame.

4.4. TMDL surface water bacterial contaminationmanagement

As mentioned previously, it is not enough to investigate the

reliability of classifying library isolates characterized by DNA

and ARA, regardless of method, without considering a context

for application. We used the application of MST in TMDL

surface water management programs as the context. The

Table 7 – Predicted source distribution of Potomac RiverWatershed library isolates using classification treemodeling

Without thresholdprobability

With 70% thresholdprobability

ARA (%) DNA (%) ARA (%) DNA (%)

Human 21.4 20.5 24.4 21.3

Livestock 29.1 20.9 29.7 22.3

Pet 10.3 12.0 14.0 14.9

Wildlife 39.3 46.6 32.0 41.5

classification model would be applied to water samples from

a targeted surface water body to estimate relative source

contributions of bacteria into the surface water. For applica-

tion in a TMDL program, the relative source contributions of

bacteria are quantified in terms of loads. Using the source

loads, bacteria loading reduction goals would be set for each

source in order to reduce the concentration of bacteria in the

targeted surface water to within regulatory levels.

We are currently developing DNA and antibiotic resistance

profiles for water samples collected from the Potomac River

Watershed. For illustrative purposes, Table 7 displays the

source percentages estimated by classifying the isolates in

the library as if they had been prepared from water samples

(i.e., sources unknown). These are the types of percentages

that would be used to establish bacterial contamination

reduction goals. The results obtained with a 70% threshold

classification probability indicate very little difference be-

tween ARA and DNA for the human source and the pet

source, but a larger difference for livestock, and an even larger

difference for wildlife. However, the ranking of relative

importance of the sources based on their percentages is the

same for ARA and DNA. Although the DNA results in this

example are more accurate than the ARA results (Tables 5

and 6), the additional accuracy does not implicate different

source management strategies relative to those indicated by

the ARA data.

5. Conclusions

Based on an analysis of a small library using the classification

tree method for developing a source-classification model, the

preliminary results show that source classification using DNA

data performs better than source classification using ARA

data. However, in a TMDL application, the practical differ-

ences in the predicted source distributions are small, and

thus indicate the same strategy for reducing bacterial

contamination in a TMDL program. In our laboratory, the

current cost per isolate of DNA PFGE analysis is four or five

times that of ARA. Identifying bacteria species in the DNA

PFGE analysis involves a process that further increases the

cost differential. Although our data and results are prelimin-

ary at this time, we conclude that questioning the value of

DNA data relative to ARA data for MST intended for

application in a TMDL program is justified, and the answer

may favor ARA data for this application.

R E F E R E N C E S

Albert, J.M., Munakata-Marr, J., Tenorio, L., Siegrist, R.L., 2003.Statistical evaluation of microbial source tracking dataobtained by rep-PCR DNA fingerprinting of Escherichia coli.Environ. Sci. Technol. 37, 4554–4560.

Breiman, L., 2001. Statistical modeling: two cultures. Stat. Sci. 16(3), 199–231.

Breiman, L., Freidman, J.H., Olshen, R.A., Stone, C.J., 1998.Classification and Regression Trees. Chapman & Hall/CRC,Boca Raton, FL.

Chou, C., Lin, Y., Su, J., 2004. Microbial indicators for differentia-tion of human- and pig-sourced fecal pollution. J. Environ. Sci.Health A Toxicol. Hazard Subst. Environ. Eng. 39 (6), 1415–1421.

ARTICLE IN PRESS

WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 3 5 7 5 – 3 5 8 43584

Gibbons, J.D., Olkin, I., Steel, M., 1977. Selecting and OrderingPopulations, A New Statistical Methodology. Wiley, New York.

Graves, A.K., Hagedorn, C., Teetor, A., Mahal, M., Booth, A.M.,Reneau, R.B., 2002. Antibiotic resistance profiles to determinesources of fecal contamination in a rural Virginia watershed.J. Environ. Qual. 31, 1300–1308.

Hagedorn, C., Robinson, S.L., J. Filtz, R., Grubbs, S.M., Angier, T.A.,Reneau Jr., R.B., 1999. Determining sources of fecal pollution ina rural Virginia watershed with antibiotic resistance patternsin fecal streptococci. Appl. Environ. Microbiol. 65, 5522–5531.

Harwood, V.J., Whitlock, J., Withington, V., 2000. Classification ofantibiotic resistance patterns of indicator bacteria by discri-minant analysis: use in predicting the source of fecalcontamination in subtropical waters. Appl. Environ. Microbiol.66, 3698–3704.

Harwood, V.J., Wiggins, B., Hagedorn, C., Ellender, R.D., Gooch, J.,Kern, J., Samadpour, M., Chapman, A.C., Robinson, B.J.,Thompson, B.C., 2003. Phenotypic library-based microbialsource tracking methods: efficacy in the California collabora-tive study. J. Water Health 1 (4), 153–166.

Hassan, W.M., Wang, S.Y., Ellender, R.D., 2005. Methods toincrease fidelity of repetitive extragenic palindromic PCRfingerprint-based microbial source tracking efforts. Appl.Environ. Microbiol. 71, 512–518.

Hastie, T., Tibshirani, R., Friedman, J.H., 2001. The Elements ofStatistical Learning. Springer, New York.

Johnson, L.K., Brown, M.B., Carruthers, E.A., Ferguson, J.A.,Dombek, P.E., Sadowsky, M.J., 2004. Sample size, librarycomposition, and genotypic diversity among natural popula-tions of Escherichia coli from different animals influenceaccuracy of determining sources of fecal pollution. Appl.Environ. Microbiol. 70, 4478–4485.

Kuntz, R., Hartel, P., Rodgers, K., Segars, W., 2004. Presence ofEnterococcus faecalis in broiler litter and wild bird feces formicrobial source tracking. Water Res. 38, 3551–3557.

Lasalde, C., Rodriguez, R., Toranzos, G.A., 2005. Statisticalanalyses: possible reasons for unreliability of source trackingefforts. Appl. Environ. Microbiol. 71, 4690–4695.

Maryland Department of the Environment, 2005. Draft totalmaximum daily loads of fecal bacteria for the non-tidalAnacostia River Basin in Montgomery and Prince George’sCounties, Maryland. [Online.] /http://www.mde.state.md.us/assets/document/Anacostia_%20fc_TMDL-08-03-2005_PN(1).pdfS.

Murray, B.E., Singh, K.V., Heath, J.D., Sharma, B.R., Weinstock, G.M.,1990. Comparison of genomic DNAs of different enterococcal

isolates using restriction endonucleases with infrequent re-cognition sites. J. Clin. Microbiol. 28, 2059–2062.

Parveen, S., Murphree, R.L., Edmiston, L., Kaspar, C.W., Portier, K.M.,Tamplin, M.L., 1997. Association of multiple-antibiotic-resis-tance profiles with point and nonpoint sources of Escherichia coliin Apalachicola Bay. Appl. Environ. Microbiol. 63, 2607–2612.

Price, B., Venso, E.A., Frana, M.F., Greenberg, J., Ware, A., Currey, L.,2006. Classification tree method for microbial source trackingwith antibiotic resistance analysis data. Appl. Environ.Microbiol. 72, 3468–3475.

Ritter, K.J., Carruthers, E., Carson, C.A., Ellender, R.D., Harwood, V.J.,Kingsley, K., Nakatsu, C., Sadowsky, M., Shear, B., West, B.,Whitlock, J.E., Wiggins, B.A., Wilbur, J.D., 2003. Assessment ofstatistical methods used in library-based approaches to mi-crobial source tracking. J. Water Health 1 (4), 209–223.

Ringner, G., Peterson, C., Khan, J., 2002. Analyzing array data usingsupervised methods. Phamacogenomics 3 (3), 403–415.

Ross, S., 2005. A First Course in Probability, seventh ed. Prentice-Hall, Englewood Cliffs, NJ.

Scott, T.M., Jenkins, T., Lukasik, J., Rose, J., 2005. Potential use of ahost associated molecular marker in Enterococcus faecium as anindex of human fecal pollution. Environ. Sci. Technol. 39,283–287.

Simpson, J.M., Santo Domingo, J.W., Reasoner, D.J., 2002. Microbialsource tracking: state of the science. Environ. Sci. Technol. 36,5279–5288.

Stoeckel, D.M., Harwood, V.J., 2007. Performance, design, andanalysis in microbial source tracking studies. Appl. Environ.Microbiol. 73, 2405–2415.

US Environmental Protection Agency, 1986. Ambient WaterQuality Criteria for Bacteria—1986. EPA-440/5-84-002.

US Environmental Protection Agency, 2005. Microbial sourcetracking guide document, EPA/600-R-05-064, June 2005. USEnvironmental Protection Agency, Washington, DC.

Wheeler, A.L., Hartel, P., Godfrey, D., Hill, J., Segars, W., 2002.Potential of Enterococcus faecalis as a human fecal indicator formicrobial source tracking. J. Environ. Qual. 31, 1286–1293.

Whitlock, J.E., Jones, D.T., Harwood, V.J., 2002. Identification of thesources of fecal coliforms in an urban watershed usingantibiotic resistance analysis. Water Res. 36, 4273–4282.

Wiggins, B.A., Andrews, R.W., Conway, R.A., Corr, C.L., Dobratz, E.J.,Dougherty, D.P., Eppard, J.R., Knupp, S.R., Limjoco, M.C.,Mettenburg, J.M., Rinehardt, J.M., Sonsino, J., Torrijos, R.L.,Zimmerman, M.E., 1999. Use of antibiotic resistance analysis toidentify nonpoint sources of fecal pollution. Appl. Environ.Microbiol. 65, 3483–3486.