mismatch distribution analysis of y-str haplotypes as a tool for the evaluation of identity-by-state...
TRANSCRIPT
Mismatch distribution analysis of Y-STR haplotypes as a tool
for the evaluation of identity-by-state proportions and
significance of matches—the European picture
Luısa Pereiraa,*, Maria Joao Prataa,b, Antonio Amorima,b
aIPATIMUP (Instituto de Patologia e Imunologia Molecular da Universidade do Porto),
R. Dr. Roberto Frias, s/n, 4200-465 Porto, PortugalbFaculdade de Ciencias da Universidade do Porto, Praca Gomes Teixeira 4050 Porto, Portugal
Received 4 April 2002; received in revised form 15 September 2002; accepted 18 September 2002
Abstract
We suggest the use of the mismatch distribution methodology as an easy way to estimate the distance between all pairs of
haplotypes present in a sample. This approach allows the evaluation of the proportion of pairs of Y-STR haplotypes that are
prone to become identical by state (IBS), in one generation, by recurrent mutation, a statistic of major importance in the
forensic field. The mismatch approach presents some advantages alternatively to the empirical one, since it is not necessary to
have simultaneous information on STRs and SNPs, and it allows the evaluation of IBS also within-haplogroups. The
estimation of IBS at an European scale showed that there is a high population substructuring for this parameter, increasing
from southern-central European countries towards west and north, in accordance to what was found for Y-biallelic markers.
This result seems to imply a more careful use of large databases for matching evaluation, even in the absence of population
structure for general Y-STR diversity. Furthermore, mismatch distribution can be used to measure the distance between a
particular haplotype and all the haplotypes in a sample. When applied to the most frequent haplotypes in Europe it revealed
that the opportunity for IBS is not directly related to the frequency of a haplotype, but highly dependent on the proportion of
neighbouring haplotypes—so, that reporting on the haplotype frequency for evaluating the significance of a match can be
misleading.
# 2002 Published by Elsevier Science Ireland Ltd.
Keywords: Mutation recurrence; Identity-by-state; Identity-by-descent; Mismatch distribution; Match
1. Introduction
Y-chromosome haplotype matching is the main tool
used in forensics to evaluate the significance of an observa-
tion of a particular haplotype in a certain sample. In order
to have a large European Reference set, the ‘‘Y-STR
Haplotype Reference Database’’ was created, allowing the
search for matches at a continental scale [1]. The application
of matching at a continental scale relies on the assumption
that there is no significant population structure in Europe for
the Y-STRs [2].
At present, the evaluation of the match is qualitative, i.e.
based only in the dichotomy equal/not-equal haplotypes, and
the significance of a match/not-match probability has not
been quantified. This quantification requires the: (1) estima-
tion of haplotype frequency, and (2) evaluation of recurrence
(i.e. generation of identical haplotypes due to mutation
instead of ancestry sharing).
With respect to the estimation of haplotype frequency, a
Bayesian approach has been recently developed and imple-
mented in the Y-STR database website [2,3], which over-
comes (a) the impossibility of applying the ‘‘product rule’’
of allele frequencies to linked markers and (b) the simple
Forensic Science International 130 (2002) 147–155
Abbreviations: IBD, identical-by-descent; IBS, identical-by-
sate; Y-BMs, Y chromosome biallelic marker; Y-STR, Y chromo-
some short tandem repeat* Corresponding author. Tel.: þ351-22-5570700;
fax: þ351-22-5570799.
E-mail address: [email protected] (L. Pereira).
0379-0738/02/$ – see front matter # 2002 Published by Elsevier Science Ireland Ltd.
PII: S 0 3 7 9 - 0 7 3 8 ( 0 2 ) 0 0 3 7 1 - 7
counting of haplotype occurrence in a database (which
would imply very large databases given the high number
of unique haplotypes). However, it cannot be overlooked that
this is just the ‘‘best guess’’ and a cautious use of it has been
suggested [3].
Nevertheless, due to the high mutation rate of STRs,
recurrence introduces an important bias, and hence, the
creation of haplotypes identical by state (IBS), rather than
by descent (IBD). The approach of De Knijff [4], which
evaluates the proportion of these classes among identical
haplotypes defined by Y-biallelic markers (Y-BMs), was
the only one used so far. Its basic assumption is that the same
Y-STR haplotype in two different Y-BM haplogroups points
to IBS and not IBD. When a certain slow-mutation event
occurs, it appears in a certain Y-STR background, and from
that point on, both sister lineages will accumulate diversity in
the fast-evolving polymorphisms, which will be proportional
to the time of divergence. So, the probability of haplotype
sharing between both (a false identity by descent) will become
more and more negligible. This approach being multi-gen-
erational has serious limitations, due to the mutational
mechanism originating STR diversity, which fits admittedly
some variant of a basic step-wise model, in which each allelic
state originates most of the times new ones, one repeat (þ/�)
away. So, it overlooks re-recurrence phenomena, and it does
not take into account intra-haplogroup homoplasy. De Knijff
[4] used the combined Y-SNP/Y-STR approach to estimate
the proportion of IBS, and obtained values of 0.8% when
studying 6 Y-STRs and 4 Y-BMs and 0.2% when raising the
number of Y-STRs to 8. Unfortunately, combined information
for both Y-STRs and Y-BMs in the same sample is very rare,
and it has been therefore impossible to measure the proportion
of IBS, at a large scale, by this approach.
We suggest an approach based upon mismatch distribu-
tions/haplotype pairwise comparison that evaluates the pro-
portion of Y-STR haplotypes on which mutation can
originate IBS, in the next generation, and we report its
distribution at a European scale.
2. Material and methods
2.1. Sample
A total of 1477 different Y-STR haplotypes in 5529
individuals were collected (by 11 April 2001) from the Y-
STR Haplotype Reference Database (http://ystr.charite.de)
for populations displayed in Table 1 (Zeeland sample was
not used in this study due to its small size and reduced
diversity). The Y-STRs considered for analysis were DYS19,
DYS389I, DYS389II, DYS390, DYS391, DYS392 and
DYS393, and this order will be maintained below. In our
calculations DYS389II number of repeats were recorded as
the difference between the overall repeat number of the
amplicon (as registered in the database) and the number of
repeats at DYS389I. We exemplify the calculations in the
population that we called The Netherlands (n ¼ 229) and
that corresponds to the sum of the samples Holland (n ¼ 87),
Friesland (n ¼ 44), Groningen (n ¼ 48) and Limburg
(n ¼ 50) deposited in the referred database.
2.2. Statistical analyses
Mismatch distributions were obtained using the ARLE-
QUIN 2.0 population genetic software [5]. It is possible to
obtain two types of mismatch distributions considering
either just the different allelic states at each locus (irrespec-
tive of their number of repeat differences) or the number of
repeat units at which two haplotypes are different over all
loci under consideration. In practice, this can be done using
two types of input files, respectively: (1) the ‘‘Microsat’’
input file, and (2) the ‘‘DNA’’ input file, where each repeat
unit was considered as an ambiguous position (N) and
differences in the number of repeat units were considered
as insertion (N) or deletion (�).
To calculate the distance in repeat units of a particular
haplotype to all the haplotypes of a sample A, the mismatch
distribution analysis can be applied by constructing an
artificial sample B, in which the haplotype under analysis
is added to the original sample A (either as another instance
of a previous occurrence or as a new one). The distribution of
the differences of that haplotype to all the haplotypes
observed in the sample can be obtained by subtracting the
mismatch distribution of A from the one of B.
3. Results and discussion
3.1. Mismatch distribution—estimation of the molecular
distance between pairs of haplotypes
The evaluation of matches in the simple dichotomic
framework of equal/not-equal haplotypes overlooks that
both categories are heterogeneous. In the first category,
we are including both IBD and IBS haplotypes, while in
the second, the molecular distance between haplotypes is not
taken into account, thus giving the same weight to differ-
ences in one/several repeat units at one/several loci.
Mismatch distributions are histograms representing, for a
set of sequences or haplotypes, the frequency of pairwise
comparisons having 0, 1, . . ., nmax differences, where nmax is
the number of differences between the most different hap-
lotypes. In population genetics they have been used as a way
to represent graphically the amount of diversity in a sample.
If the distribution is L-shaped it indicates that many
sequences are identical with much fewer comparisons with
sequences that are genetically different. If it is bell-shaped,
with mean k, it means that many comparisons are more or
less distant by the same amount k. Moreover, it has been
shown that the shape of mismatch distributions is highly
influenced by ancient demographic events and selective
sweeps [6,7]. In that way, a mismatch distribution is an
148 L. Pereira et al. / Forensic Science International 130 (2002) 147–155
Table 1
Y-STR diversity and proportions (%) of the haplotype pool prone to differential-by-sate (DBS) and identical-by-sate (IBS) in Europe
Population N No.
haplotype
Gene diversity �standard error
Class 0
(%)
% of the pool
prone to DBS
Class 1
(%)
% of the pool
prone to IBS
Northern Portugal 182 105 0.982 � 0.004 1.85 0.082 6.70 0.020
Spain 365 208 0.983 � 0.003 1.69 0.075 5.77 0.018
Asturias 90 65 0.983 � 0.007 1.72 0.076 6.37 0.019
Galicia 103 71 0.980 � 0.007 1.96 0.087 5.41 0.016
Granada 52 43 0.989 � 0.008 1.13 0.050 6.11 0.019
Zaragoza 120 86 0.984 � 0.006 1.60 0.071 5.78 0.018
Italy 553 342 0.991 � 0.002 0.92 0.041 3.02 0.009
Modena 99 70 0.977 � 0.009 2.27 0.101 6.56 0.020
NWItaly 131 98 0.983 � 0.007 1.71 0.076 3.69 0.011
Rome 216 161 0.995 � 0.002 0.53 0.024 2.38 0.007
Toscany 107 87 0.993 � 0.003 0.74 0.033 2.56 0.008
Innsbruck—Austria 135 94 0.990 � 0.003 0.98 0.043 3.45 0.010
Budapest—Hungary 117 94 0.996 � 0.002 0.43 0.019 2.18 0.007
Switzerland 199 141 0.989 � 0.003 1.11 0.049 4.26 0.013
Bern 91 68 0.990 � 0.004 1.03 0.046 4.15 0.013
Lausanne 108 86 0.989 � 0.004 1.09 0.048 4.29 0.013
Germany 2096 748 0.992 � 0.001 0.83 0.037 3.18 0.010
Berlin 429 254 0.992 � 0.001 0.80 0.036 2.95 0.009
Cologne 135 98 0.989 � 0.004 1.07 0.047 3.57 0.011
Dusseldorf 150 98 0.984 � 0.004 1.63 0.072 5.32 0.016
Freiburg 211 129 0.987 � 0.003 1.35 0.060 4.84 0.015
Hamburg 114 101 0.997 � 0.002 0.26 0.012 1.74 0.005
Leipzig 405 236 0.995 � 0.001 0.49 0.022 2.07 0.006
Magdburg 177 120 0.991 � 0.002 0.91 0.040 3.27 0.010
Mainz 104 75 0.990 � 0.003 1.03 0.046 3.68 0.011
Munich 251 156 0.989 � 0.002 1.14 0.051 3.74 0.011
Munster 58 42 0.981 � 0.010 1.94 0.086 6.17 0.019
Stuttgart 61 50 0.991 � 0.005 1.03 0.046 4.77 0.015
Belgium 97 70 0.985 � 0.006 1.55 0.069 5.61 0.017
The Netherlands 229 123 0.980 � 0.004 2.05 0.091 6.38 0.019
Friesland 44 34 0.985 � 0.009 1.48 0.066 6.66 0.020
Groningen 48 30 0.963 � 0.015 2.75 0.122 4.58 0.014
Holland 87 64 0.983 � 0.006 1.74 0.077 5.91 0.018
Limburg 50 33 0.972 � 0.011 2.78 0.123 7.92 0.024
Poland 596 203 0.985 � 0.002 1.51 0.067 5.83 0.018
Bydgoszcs 168 103 0.987 � 0.003 1.26 0.056 5.69 0.017
Northern Poland 150 92 0.985 � 0.004 1.46 0.065 5.56 0.017
Warsaw 157 97 0.985 � 0.004 1.52 0.067 5.19 0.016
Wroclaw 121 75 0.983 � 0.005 1.75 0.078 7.34 0.022
Estonia 133 93 0.987 � 0.004 1.31 0.058 3.68 0.011
Latvia 145 100 0.991 � 0.003 0.93 0.041 4.53 0.014
Lithuania 151 100 0.988 � 0.003 1.17 0.052 4.73 0.014
Moscou—Russia 85 55 0.978 � 0.008 2.24 0.099 6.44 0.020
Norway 300 164 0.986 � 0.003 1.41 0.063 3.75 0.011
Norway Central 48 39 0.986 � 0.009 1.42 0.063 3.63 0.011
Norway East 85 57 0.982 � 0.006 1.82 0.081 5.66 0.017
Norway North 45 39 0.992 � 0.007 0.81 0.036 3.54 0.011
Norway Oslo 33 26 0.983 � 0.012 1.70 0.075 3.41 0.010
Norway South 25 23 0.990 � 0.016 1.00 0.044 2.67 0.008
Norway West 64 53 0.991 � 0.006 0.89 0.039 3.32 0.010
Buenos Aires 100 76 0.988 � 0.005 1.23 0.055 2.87 0.009
Class 0: equal haplotypes; class 1: haplotypes differing in one repeat unit.
L. Pereira et al. / Forensic Science International 130 (2002) 147–155 149
easy way of obtaining the molecular distance between all
pairs of haplotypes in a certain population. This strategy
solves the difficulty to evaluating the molecular proximity
between haplotypes—visually it is impossible to grasp the
multidimensional network (for instance, pooling each locus
in a axis, it would be necessary to define as many dimensions
as the number of loci considered for the haplotype; as
pointed out by Krawczak, personal communication) that
connects all the haplotypes in a sample, which makes Y-
STR networks impossible to interpret [8].
Fig. 1 represents both types of mismatch distributions in
The Netherlands. It is noteworthy that 86.8% (6.38%/7.35%)
of the differences between haplotypes differing at one locus
consist in a single repeat unit.
In the observed mismatch distribution there are two
classes that are important when considering the dichotomy
equal/not-equal, namely: (a) class 0, for the proportion of
identical haplotypes, and (b) class 1, for the proportion of
haplotypes differing by a single step. Since mutation in STRs
occurs mainly by affecting one repeat unit each time, it will
be upon these two classes that mutation will act in the next
generation, creating haplotypes different-by-state (DBS) and
IBS, respectively.
3.2. Estimation of IBS and DBS
The probability of a certain Y-STR haplotype becoming
IBS to another one, in the next generation, will be obtained by
single step mutation for pairs of haplotypes differing in one
repeat unit in a certain locus. Class 1 of ‘‘DNA’’ mismatch
distribution, gives the proportion of pairs of haplotypes that
differ in one repeat unit in a certain locus, and hence, those
that are prone to become IBS in the next generation.
The probability of class 1 to class 0 transition (i.e. of two Y-
STR haplotype becoming IBS) per generation can be roughly
estimated as mð1 � mÞ2n�1f1, where n is the number of loci
defining the haplotype (seven in our case), m the average
mutation rate (3:17 � 10�3) [9], and f1 the frequency of class
1. The rationale behind the formula is briefly the following.
Inside class 1 there are n different subclasses corresponding to
each of the cases: haplotypes differing just by one repeat at the
first locus, at the second, and so on. For simplicity sake we
will assume that all loci are equally diverse (but it can easily
be demonstrated that this assumption is not essential) and then
class 1 is divisible into equally frequent n subclasses (f1/n).
Considering now the first locus, for the haplotype pair, there is
a chance m/2 of the longer one losing one repeat (but mutation
cannot occur in the other member of the pair, an event with
probability 1 � m) and the same probability for the shorter
one gaining it (not considering the slight bias that seems to
favour the gain [9]). Therefore, the overall probability can be
estimated as ðm=2Þð1 � mÞ þ ð1 � mÞðm=2Þ ¼ mð1 � mÞ.The transitional probability for just one locus will thus be
given by mð1 � mÞf1=n. Analysing now two loci, there will be
two subclasses to consider: those pairs differing in the first
(but not in the second) locus and vice versa, each with
frequency f1/n. Let’s calculate the probability for the first
kind of pair; the only events that would lead to IBS require
that mutation occurs (in the right directions as above) in just
one of the members of the pair at a time (which now
encompasses four alleles, two at each haplotype), so
m1ð1 � mÞ3. Adding up the other pair, the transition prob-
ability for two loci would be 2m1ð1 � mÞ3f1=n. Thus, gen-
eralising to n loci it turns into nmð1 � mÞ2n�1f1=n. For The
Netherlands the corresponding is value is 0.0194%.
There are more complex and rare phenomena contribut-
ing to IBS, but comparatively insignificant. For instance, the
probability of IBS resulting from mutations involving two
repeat units in one locus would be m2ð1 � m2Þ2n�1f1i, where
m2 is the average mutation rate for two repeat units in one
locus (3:17 � 10�4, since this kind of mutation is roughly 10
times less frequent than the single step) [9], and f1i the
frequency of pairs of haplotypes differing by two steps at one
locus (an overestimation can be obtained by the difference
between values observed for classes 1 from both mismatch
distributions, 0.97%, which in fact corresponds to the pro-
portion of haplotypes differing by more than one repeat unit
in one locus), which gives a value of 0.000306%, which is
approximately 50 times less than the probability calculated
above.
The probability of becoming DBS can be approximated
by the formula 2nmf0, where f0 is the frequency of class 0.
Fig. 1. Mismatch distributions for the number of loci (A) and number of repeat unit (B) pairwise differences for Y-STRs in The Netherlands
sample.
150 L. Pereira et al. / Forensic Science International 130 (2002) 147–155
Indeed, when comparing a pair of equal haplotypes, all the
mutations, either losing or gaining repeat units, in any locus,
will turn the pair different. On the other hand, the probability
of occurrence of the same mutation in both haplotypes (m2)
is nearly zero. The corresponding value for The Netherlands
would be 0.0910%.
3.3. Estimation of IBS, in the next generation, in an
European scale
The results for the proportion of the Y-STR pool that is
prone to become IBS, in the next generation, at a European
scale are presented in Table 1.
There is an apparent westward and northward gradient
showing an increase of the propensity for IBS, in the next
generation, from a focus in the southern-central Europe
(Fig. 2). Hungary and Italy display the lowest values, around
one half of those observed in the more extreme western and
northern regions.
These gradients are concordant, as expected, with the
distribution of gene diversity values, not only in Y-STRs
(Table 1), but also for Y-BMs (Fig. 2), described in a similar
area by Rosser et al. [10]. Southern-central Europe is located
in a convergent area of differently oriented Y-BM clines,
from the southeast into the northwest [11]. So, a lower
probability of haplotypes becoming IBS within-haplogroups
is expected, since the number of individuals inside hap-
logroups is also low; in opposition the propensity for IBS
haplotypes between haplogroups is predictably higher, since
there is a larger number of haplogroups.
3.4. Mismatch distribution as a tool to measure the
distance of a particular haplotype to the database
As already pointed out, there is a great bias when analys-
ing a match in the classical fashion, by checking in the
database if there is a match or not, and if so reporting the
haplotype frequency. This method disregards the fact of that
Fig. 2. Values of Y-STR IBS (in %) and Y-BM diversity (italic) at an European scale.
Fig. 3. Mismatch distribution for the differences in repeat units between each of the Y-STR haplotypes belonging to cluster 1 (A) and to
cluster 2 (B), respectively, and all the haplotypes in Europe.
L. Pereira et al. / Forensic Science International 130 (2002) 147–155 151
particular haplotype being molecularly close to some or
several haplotypes, and hence being prone by mutation to be
IBS or DBS. By mismatch distribution analysis, it is possible
to calculate the molecular distance between one particular
haplotype and all the remaining haplotypes in a certain
population.
We applied this strategy to the European database. Given
the mutational mechanism generating STR diversity, it is
obviously expected that the haplotype with more molecu-
larly close haplotypes will be the most frequent one. So let us
suppose that we are investigating a match for the haplotype
14-13-16-24-11-13-13 (E0451), that occurs in the European
sample with a frequency of 5.14%; calculating the molecular
distance to all the haplotypes in the European sample
(Fig. 3A and Table 2), we observe that there is a substantial
proportion (11.03%) of haplotypes that differ only by one-
step, rendering very high the probability for IBS or DBS. Let
us now take a look to the properties of its closest haplotypes,
that is, all the haplotypes that are one-step neighbours to it
(which will be called cluster 1, and that reaches globally the
frequency of 16.17%)—all show a high frequency for class
1, and so, a significant propensity for IBS. We extended the
same analysis to the one-step neighbours of haplotypes 14-
13-16-23-11-13-13 (E0414) and 14-13-16-24-10-13-13
(E0435), which are respectively the second and third most
frequent European haplotypes, and that were one-step neigh-
bours to E0451 (those clusters will be called cluster 1.a and
1.b, respectively). Summing clusters 1, 1.a and 1.b, a
frequency of 23.75% is obtained, being, for sure, the main
reason for the first lump in the bimodal distribution dis-
played by almost all the haplotypes of cluster 1 (Fig. 3A) and
clusters 1.a and 1.b (not shown); while the second lump
represents the pairwise comparisons with the remaining
haplotypes in the sample.
Besides those haplotypes, the fourth more frequent one is:
14-12-16-22-10-11-13 (E0216; frequency of 2.30% in Eur-
ope), which is six steps away from E0451 (Table 3). Together
with its one-step neighbours (cluster 2) contributes to only
6.20% of the European sample. And the main point is that
comparing haplotypes from cluster 1 with cluster 2, which
have comparable frequency, we see that the proportion of
class 1 is very different between both (e.g. E0435 of cluster 1
with a frequency of 2.44 has 9.75% for class 1 while E0216
of cluster 2 with a frequency of 2.30% has 3.91% for class 1).
This shows how biased is to give information based in the
frequency of the particular haplotype, disregarding the
information of the proportion of neighbouring haplotypes
in the sample.
Notice that mismatch distributions for clusters 2 and 2.a
(haplotype E0230 and its one-step neighbours) are unimodal
(Fig. 3B; not shown for cluster 2.a). This pattern is similar to
the one displayed by the overall haplotype pairwise com-
parison in the European sample (Fig. 4). It seems that the
bimodal shape displayed by the main class of haplotypes in
Europe resembles the bimodal pattern displayed by the Y-
chromosome biallelic markers [11], but the overall Y-STR
European sample is already unimodal, probably due to the
high mutation rate in STRs.
Although, as stressed before, the overall Y-STR diversity
is not structured in Europe [2], once again, there seems to be
Table 2
Frequencies of a particular Y-STR haplotype (class 0) and its one-
step neighbours (class 1) inside clusters 1, 1.a and 1.b in the
European sample
N Class 0
(%)
Class 1
(%)
E0451 14-13-16-24-11-13-13 284 5.14 11.03
E0414 14-13-16-23-11-13-13 149 2.69 9.26
E0435 14-13-16-24-10-13-13 135 2.44 9.75
E0640 14-14-16-24-11-13-13 64 1.16 7.22
E0530 14-13-17-24-11-13-13 50 0.90 7.16
E0475 14-13-16-25-11-13-13 46 0.83 6.53
E0908 15-13-16-24-11-13-13 35 0.63 6.85
E0450 14-13-16-24-11-13-12 32 0.58 6.26
E0256 14-12-16-24-11-13-13 19 0.34 6.02
E0362 14-13-15-24-11-13-13 16 0.29 6.37
E0452 14-13-16-24-11-13-14 16 0.29 5.81
E0446 14-13-16-24-11-12-13 16 0.29 5.77
E0460 14-13-16-24-12-13-13 12 0.21 5.53
E0454 14-13-16-24-11-14-13 11 0.20 5.91
E0069 13-13-16-24-11-13-13 9 0.16 5.39
Cluster 1: 894, 16.17%
E0403 14-13-16-23-10-13-13 74 1.34 6.53
E0621 14-14-16-23-11-13-13 33 0.60 4.43
E0349 14-13-15-23-11-13-13 26 0.47 3.64
E0508 14-13-17-23-11-13-13 19 0.34 4.58
E0883 15-13-16-23-11-13-13 17 0.31 3.73
E0416 14-13-16-23-11-14-13 13 0.24 3.18
E0241 14-12-16-23-11-13-13 10 0.18 3.42
E0420 14-13-16-23-12-13-13 9 0.16 3.09
E0411 14-13-16-23-11-12-13 8 0.14 3.18
E0413 14-13-16-23-11-13-12 7 0.13 3.60
E0389 14-13-16-22-11-13-13 7 0.13 3.07
E0415 14-13-16-23-11-13-14 4 0.07 3.36
E0061 13-13-16-23-11-13-13 1 0.02 2.95
Cluster 1.a (without E0451 and E0414): N ¼ 228, 4.12%
E0521 14-13-17-24-10-13-13 30 0.54 4.32
E0468 14-13-16-25-10-13-13 27 0.49 3.85
E0901 15-13-16-24-10-13-13 25 0.45 3.60
E0632 14-14-16-24-10-13-13 21 0.38 4.27
E0434 14-13-16-24-10-13-12 17 0.31 3.53
E0357 14-13-15-24-10-13-13 15 0.27 3.27
E0250 14-12-16-24-10-13-13 13 0.24 3.17
E0439 14-13-16-24-10-14-13 11 0.20 3.04
E0436 14-13-16-24-10-13-14 10 0.18 3.00
E0432 14-13-16-24-10-12-13 8 0.14 3.07
E0067 13-13-16-24-10-13-13 3 0.05 2.82
E0424 14-13-16-24-11-13-13 1 0.02 2.51
Cluster 1.b (without E0435, E0451, E0414 and E0403): N ¼ 181,
3.27%
Total 1303, 23.57%
For clusters’ definition see text.
152 L. Pereira et al. / Forensic Science International 130 (2002) 147–155
a non-random geographical pattern for the distribution of
cluster 1 in Europe (Fig. 5A). It is clearly less frequent in the
eastern portion of the continent and seems to increase in
frequency from south-central Europe towards west and
north, resembling the pattern just described for propensity
to IBS. Cluster 2 (Fig. 5B) is more frequent in north-central
Europe, and decreases towards west and east. The compar-
ison of the diversities of the two clusters, suggests that
cluster 1 is more ancient than cluster 2.
4. Final considerations
STRs have been markers of choice used in forensics due
to its high polymorphism. SNPs, although presenting high
stability, are much less polymorphic, becoming less infor-
mative in the forensic field, while more advanced technol-
ogies allowing its easy and fast typing are not available. So,
extensive forensic databases available for matching are now
(and will be maintained for more years) exclusively based on
STRs.
It seems, therefore, that the approach presented here can
be useful in the forensic field, for the evaluation of the
significance and evidential value of Y-STR haplotype
matches. This importance is best illustrated by comparing
the statistics of haplotypes with distinctive situations in
terms of abundance of neighbouring haplotypes. For that
we take as an example two non-observed haplotypes, 14-12-
16-22-09-11-13 and 19-15-15-25-10-11-13, which were
chosen because they lay respectively nearby cluster 2 and
more than one-step away from the observed haplotypes.
While their estimated frequencies, by the surveying method
as trace haplotypes are 3:0 � 10�4 and 1:4 � 10�4, respec-
tively, their probabilities of occurrence in the next generation
by one-step mutation are 2.33 and 0%, respectively.
This approach for IBS evaluation has different assump-
tions relatively to the one based in the joined Y-STR/Y-BM
information. The first is limited to one generation while the
second is multigenerational; but while the second has to
assume only IBS between-haplogroups, the first considers
both, although cannot distinguish them (except if we have
assess also to Y-BM information). The comparison between
the two values obtained for the Netherlands population
(0.0194% for the 7 Y-STR approach and 0.2% for 8Y-
STR/4Y-BM) shows that roughly 1/10 of the observed
IBS is attributable to one-generational events. This finding
is particularly important in the forensic field, where we are
interested in recent ancestry (kinship 0 in criminalistic or 1
in paternity), and pairs of haplotypes will be more similar
within-haplogroups, so more prone to IBS, than pairs of
haplotypes between-haplogroups.
A main issue is the use of large scaled databases for
broad matching survey, which legitimacy is based in the
claim that there is no significant population substructuring
for STRs in Europe (contrarily to SNPs). Here, we demon-
strate that this claim is at least debatable, since the oppor-
tunity for IBS varies a lot across Europe, as judged from the
mismatch distributions, and in a way that is congruent with
the well defined SNPs haplogroups’ gradients: higher
SNP’s diversity in south-central Europe associated with
Table 3
Frequencies of a particular Y-STR haplotype (Class 0) and its one-
step neighbours (Class 1) inside clusters 2 and 2.a in the European
sample
N Class 0
(%)
Class 1
(%)
E0216 14-12-16-22-10-11-13 127 2.30 3.91
E0230 14-12-16-23-10-11-13 94 1.70 3.67
E0730 15-12-16-22-10-11-13 31 0.56 3.78
E0217 14-12-16-22-10-11-14 21 0.38 2.98
E0381 14-13-16-22-10-11-13 18 0.33 2.95
E0272 14-12-17-22-10-11-13 17 0.31 3.69
E0223 14-12-16-22-11-11-13 10 0.18 2.66
E0215 14-12-16-22-10-11-12 8 0.14 2.60
E0195 14-12-15-22-10-11-13 5 0.09 2.41
E0207 14-12-16-21-10-11-13 4 0.07 2.39
E0189 14-11-16-22-10-11-13 3 0.05 2.32
E0219 14-12-16-22-10-12-13 2 0.04 2.53
E0212 14-12-16-22-10-10-13 2 0.04 2.35
E0007 13-12-16-22-10-11-13 1 0.02 2.35
NO1 14-12-16-22-09-11-13 0 0 2.33
Cluster 2: N ¼ 343, 6.20%
E0280 14-12-17-23-10-11-13 24 0.43 2.56
E0396 14-13-16-23-10-11-13 14 0.25 3.07
E0741 15-12-16-23-10-11-13 13 0.24 2.75
E0229 14-12-16-23-10-11-12 7 0.13 2.41
E0246 14-12-16-24-10-11-13 7 0.13 2.08
E0231 14-12-16-23-10-11-14 4 0.07 2.24
E0237 14-12-16-23-11-11-13 3 0.05 2.08
E0233 14-12-16-23-10-12-13 1 0.02 1.95
E0199 14-12-15-23-10-11-13 1 0.02 1.92
E0190 14-11-16-23-10-11-13 1 0.02 1.77
E0228 14-12-16-23-09-11-13 1 0.02 1.75
NO2 13-12-16-23-10-11-13 0 0 1.81
NO3 14-12-16-23-10-10-13 0 0 1.74
Cluster 2.a (without E0230 and E0216): N ¼ 76, 1.37%
Total 419, 7.57%
NO: not observed.
Fig. 4. Mismatch distribution for the differences in repeat units for
the Y-STR haplotypes in Europe.
L. Pereira et al. / Forensic Science International 130 (2002) 147–155 153
lower proportion of haplotype pairs prone to IBS; and
decreasing SNP’s diversity towards west and north asso-
ciated with higher risk of recurrence. This is also in
agreement with findings by Chikhi et al. [12], who have
shown the existence of significant clines across Europe for
microsatellites and minisatelites. So, if there is no popula-
tion structure for Y-STR diversity at European scale, the
same can be not said relatively to the propensity for IBS.
The use of mismatch distribution to measure the mole-
cular distance between a particular haplotype and all the
haplotypes in a sample showed, as expected, that the
propensity for IBS is largely dependent on the proportions
of neighbouring haplotypes. So, evaluation of a match only
in terms of haplotype frequency can be a very biased
estimation, although, of course, the cloud of related hap-
lotypes is dependent on the frequency—a much more
informative estimate could be given in the form of how
frequently one-step neighbour haplotypes are present in the
sample.
Acknowledgements
This work was partially supported by a research grant
(PRAXIS XXI BD/13632/97) from Fundacao para a Ciencia
e a Tecnologia and IPATIMUP by Programa Operacional
Ciencia, Tecnologia e Inovacao (POCTI), Quadro Comuni-
tario de Apoio III. We wish to thank Mark Beaumont,
Fig. 5. Frequency distribution of the Y-STR haplotypes belonging to the clusters 1 (A) and 2 (B) in Europe.
154 L. Pereira et al. / Forensic Science International 130 (2002) 147–155
Lounes Chikhi and Richard Nichols for comments on earlier
versions of the manuscript.
References
[1] L. Roewer, M. Krawczak, S. Willuweit, M. Nagy, C. Alves,
A. Amorim, K. Anslinger, C. Augustin, A. Betz, E. Bosch, A.
Caglia, A. Carracedo, D. Corach, A.F. Dekairelle, T. Dobosz,
B.M. Dupuy, S. Furedi, C. Gehrig, L. Gusmao, J. Henke, L.
Henke, M. Hidding, C. Hohoff, B. Hoste, M.A. Jobling, H.J.
Kargel, P. De Knijff, R. Lessig, E. Liebeherr, M. Lorente, B.
Martinez-Jarreta, P. Nievas, M. Nowak, W. Parson, V.L.
Pascali, G. Penacino, R. Ploski, B. Rolf, A. Sala, U. Schmidt,
C. Schmitt, P.M. Schneider, R. Szibor, J. Teifel-Greding, M.
Kayser, Online reference database of European Y-chromoso-
mal short tandem repeat (STR) haplotypes, Forensic Sci. Int.
118 (2001) 106–113.
[2] L. Roewer, M. Kayser, P. De Knijff, K. Anslinger, A. Betz, A.
Caglia, D. Corach, S. Furedi, L. Henke, M. Hidding, H.J.
Kargel, R. Lessig, M. Nagy, V.L. Pascali, W. Parson, B. Rolf,
C. Schmitt, R. Szibor, J. Teifel-Greding, M. Krawczak, A new
method for the evaluation of matches in non-recombining
genomes: application to Y-chromosomal short tandem repeat
(STR) haplotypes in European males, Forensic Sci. Int. 114
(2000) 31–43.
[3] M. Krawczak, Forensic evaluation of Y-STR haplotype
matches: a comment, Forensic Sci. Int. 118 (2001) 114–115.
[4] P. De Knijff, Y chromosome shared by descent or by state, in:
C. Renfrew, K. Boyle (Eds.), Archaeogenetics: DNA and the
Population Prehistory of Europe, McDonald Institute Mono-
graphs, Oxbow Books, Cambridge, 2000, pp. 301–304.
[5] S. Schneider, D. Roessli, L. Excoffier, Arlequin ver. 2.0: a
software for population genetic data analysis, Genetics and
Biometry Laboratory, University of Geneva, Switzerland, 2000.
[6] M. Slatkin, R.R. Hudson, Pairwise comparisons of mitochon-
drial DNA sequences in stable and exponentially growing
populations, Genetics 129 (1991) 555–562.
[7] A.R. Rogers, H. Harpending, Population growth makes waves
in the distribution of pairwise genetic differences, Mol. Biol.
Evol. 9 (1992) 552–569.
[8] L. Roewer, M. Kayser, P. Dieltjes, M. Nagy, E. Bakker, M.
Krawczak, P. De Knijff, Analysis of molecular variance
(AMOVA) of Y-chromosome-specific microsatellites in two
closely related human populations, Hum. Mol. Genet. 5
(1997) 1029–1033.
[9] M. Kayser, L. Roewer, M. Hedman, L. Henke, J. Henke, S.
Brauer, C. Kruger, M. Krawczak, M. Nagy, T. Dobosz, R.
Szibor, P. De Knijff, M. Stoneking, A. Sajantila, Character-
istics and frequency of germline mutations at microsatellite
loci from the human Y chromosome, as revealed by direct
observation in father/son pairs, Am. J. Hum. Genet. 66 (2000)
1580–1588.
[10] Z.H. Rosser, T. Zerjal, M.E. Hurles, M.A. Adojaan, D.
Alavantic, A. Amorim, W. Amos, M. Armenteros, E. Arroyo,
G. Barbujani, L. Beckman, J. Bertranpetit, E. Bosch, D.G.
Bradley, G. Brede, G.C. Cooper, H.B.S.M. Corte-Real, P. De
Knijff, R. Decorte, Y.E. Dubrova, O. Evgrafov, A. Gilissen,
S. Glisic, M. Golge, E.W. Hill, A. Jeziorowska, L.
Kalaydjieva, M. Kayser, S.A. Kravchenco, J. Lavinha, L.A.
Livshits, S. Maria, K. McElreavey, T.A. Meitinger, B.
Melegh, R.J. Mitchell, J. Nicholson, S. Nørby, A. Novelletto,
A. Pandya, J. Parik, P.C. Patsalis, L. Pereira, B. Peterlin, G.
Pielberg, M.J. Prata, C. Previdere, K. Rajczy, L. Roewer, S.
Rootsi, D.C. Rubinsztein, J. Saillard, F.R. Santos, M.
Shlumukova, G. Stefanescu, B.C. Sykes, A. Tolun, R.
Villems, C. Tyler-Smith, M.A. Jobling, Y-chromosomal
diversity within Europe is clinal and influenced primarily
by geography rather than language, Am. J. Hum. Genet. 67
(2000) 1526–1543.
[11] L. Pereira, I. Dupanloup, Z.H. Rosser, M.A. Jobling, G.
Barbujani, Y-chromosome mismatch distributions in Europe,
Mol. Biol. Evol. 18 (2001) 1259–1271.
[12] L. Chikhi, G. Destro-Bisol, V. Pascali, V. Baravelli, M.
Dobosz, G. Barbujani, Clinal variation in the DNA of
Europeans, Hum. Biol. 70 (1998) 643–657.
L. Pereira et al. / Forensic Science International 130 (2002) 147–155 155