mismatch distribution analysis of y-str haplotypes as a tool for the evaluation of identity-by-state...

9
Mismatch distribution analysis of Y-STR haplotypes as a tool for the evaluation of identity-by-state proportions and significance of matches—the European picture Luı ´sa Pereira a,* , Maria Joa ˜o Prata a,b , Anto ´nio Amorim a,b a IPATIMUP (Instituto de Patologia e Imunologia Molecular da Universidade do Porto), R. Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal b Faculdade de Cie ˆncias da Universidade do Porto, Prac ¸a Gomes Teixeira 4050 Porto, Portugal Received 4 April 2002; received in revised form 15 September 2002; accepted 18 September 2002 Abstract We suggest the use of the mismatch distribution methodology as an easy way to estimate the distance between all pairs of haplotypes present in a sample. This approach allows the evaluation of the proportion of pairs of Y-STR haplotypes that are prone to become identical by state (IBS), in one generation, by recurrent mutation, a statistic of major importance in the forensic field. The mismatch approach presents some advantages alternatively to the empirical one, since it is not necessary to have simultaneous information on STRs and SNPs, and it allows the evaluation of IBS also within-haplogroups. The estimation of IBS at an European scale showed that there is a high population substructuring for this parameter, increasing from southern-central European countries towards west and north, in accordance to what was found for Y-biallelic markers. This result seems to imply a more careful use of large databases for matching evaluation, even in the absence of population structure for general Y-STR diversity. Furthermore, mismatch distribution can be used to measure the distance between a particular haplotype and all the haplotypes in a sample. When applied to the most frequent haplotypes in Europe it revealed that the opportunity for IBS is not directly related to the frequency of a haplotype, but highly dependent on the proportion of neighbouring haplotypes—so, that reporting on the haplotype frequency for evaluating the significance of a match can be misleading. # 2002 Published by Elsevier Science Ireland Ltd. Keywords: Mutation recurrence; Identity-by-state; Identity-by-descent; Mismatch distribution; Match 1. Introduction Y-chromosome haplotype matching is the main tool used in forensics to evaluate the significance of an observa- tion of a particular haplotype in a certain sample. In order to have a large European Reference set, the ‘‘Y-STR Haplotype Reference Database’’ was created, allowing the search for matches at a continental scale [1]. The application of matching at a continental scale relies on the assumption that there is no significant population structure in Europe for the Y-STRs [2]. At present, the evaluation of the match is qualitative, i.e. based only in the dichotomy equal/not-equal haplotypes, and the significance of a match/not-match probability has not been quantified. This quantification requires the: (1) estima- tion of haplotype frequency, and (2) evaluation of recurrence (i.e. generation of identical haplotypes due to mutation instead of ancestry sharing). With respect to the estimation of haplotype frequency, a Bayesian approach has been recently developed and imple- mented in the Y-STR database website [2,3], which over- comes (a) the impossibility of applying the ‘‘product rule’’ of allele frequencies to linked markers and (b) the simple Forensic Science International 130 (2002) 147–155 Abbreviations: IBD, identical-by-descent; IBS, identical-by- sate; Y-BMs, Y chromosome biallelic marker; Y-STR, Y chromo- some short tandem repeat * Corresponding author. Tel.: þ351-22-5570700; fax: þ351-22-5570799. E-mail address: [email protected] (L. Pereira). 0379-0738/02/$ – see front matter # 2002 Published by Elsevier Science Ireland Ltd. PII:S0379-0738(02)00371-7

Upload: luisa-pereira

Post on 04-Jul-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Mismatch distribution analysis of Y-STR haplotypes as a tool

for the evaluation of identity-by-state proportions and

significance of matches—the European picture

Luısa Pereiraa,*, Maria Joao Prataa,b, Antonio Amorima,b

aIPATIMUP (Instituto de Patologia e Imunologia Molecular da Universidade do Porto),

R. Dr. Roberto Frias, s/n, 4200-465 Porto, PortugalbFaculdade de Ciencias da Universidade do Porto, Praca Gomes Teixeira 4050 Porto, Portugal

Received 4 April 2002; received in revised form 15 September 2002; accepted 18 September 2002

Abstract

We suggest the use of the mismatch distribution methodology as an easy way to estimate the distance between all pairs of

haplotypes present in a sample. This approach allows the evaluation of the proportion of pairs of Y-STR haplotypes that are

prone to become identical by state (IBS), in one generation, by recurrent mutation, a statistic of major importance in the

forensic field. The mismatch approach presents some advantages alternatively to the empirical one, since it is not necessary to

have simultaneous information on STRs and SNPs, and it allows the evaluation of IBS also within-haplogroups. The

estimation of IBS at an European scale showed that there is a high population substructuring for this parameter, increasing

from southern-central European countries towards west and north, in accordance to what was found for Y-biallelic markers.

This result seems to imply a more careful use of large databases for matching evaluation, even in the absence of population

structure for general Y-STR diversity. Furthermore, mismatch distribution can be used to measure the distance between a

particular haplotype and all the haplotypes in a sample. When applied to the most frequent haplotypes in Europe it revealed

that the opportunity for IBS is not directly related to the frequency of a haplotype, but highly dependent on the proportion of

neighbouring haplotypes—so, that reporting on the haplotype frequency for evaluating the significance of a match can be

misleading.

# 2002 Published by Elsevier Science Ireland Ltd.

Keywords: Mutation recurrence; Identity-by-state; Identity-by-descent; Mismatch distribution; Match

1. Introduction

Y-chromosome haplotype matching is the main tool

used in forensics to evaluate the significance of an observa-

tion of a particular haplotype in a certain sample. In order

to have a large European Reference set, the ‘‘Y-STR

Haplotype Reference Database’’ was created, allowing the

search for matches at a continental scale [1]. The application

of matching at a continental scale relies on the assumption

that there is no significant population structure in Europe for

the Y-STRs [2].

At present, the evaluation of the match is qualitative, i.e.

based only in the dichotomy equal/not-equal haplotypes, and

the significance of a match/not-match probability has not

been quantified. This quantification requires the: (1) estima-

tion of haplotype frequency, and (2) evaluation of recurrence

(i.e. generation of identical haplotypes due to mutation

instead of ancestry sharing).

With respect to the estimation of haplotype frequency, a

Bayesian approach has been recently developed and imple-

mented in the Y-STR database website [2,3], which over-

comes (a) the impossibility of applying the ‘‘product rule’’

of allele frequencies to linked markers and (b) the simple

Forensic Science International 130 (2002) 147–155

Abbreviations: IBD, identical-by-descent; IBS, identical-by-

sate; Y-BMs, Y chromosome biallelic marker; Y-STR, Y chromo-

some short tandem repeat* Corresponding author. Tel.: þ351-22-5570700;

fax: þ351-22-5570799.

E-mail address: [email protected] (L. Pereira).

0379-0738/02/$ – see front matter # 2002 Published by Elsevier Science Ireland Ltd.

PII: S 0 3 7 9 - 0 7 3 8 ( 0 2 ) 0 0 3 7 1 - 7

counting of haplotype occurrence in a database (which

would imply very large databases given the high number

of unique haplotypes). However, it cannot be overlooked that

this is just the ‘‘best guess’’ and a cautious use of it has been

suggested [3].

Nevertheless, due to the high mutation rate of STRs,

recurrence introduces an important bias, and hence, the

creation of haplotypes identical by state (IBS), rather than

by descent (IBD). The approach of De Knijff [4], which

evaluates the proportion of these classes among identical

haplotypes defined by Y-biallelic markers (Y-BMs), was

the only one used so far. Its basic assumption is that the same

Y-STR haplotype in two different Y-BM haplogroups points

to IBS and not IBD. When a certain slow-mutation event

occurs, it appears in a certain Y-STR background, and from

that point on, both sister lineages will accumulate diversity in

the fast-evolving polymorphisms, which will be proportional

to the time of divergence. So, the probability of haplotype

sharing between both (a false identity by descent) will become

more and more negligible. This approach being multi-gen-

erational has serious limitations, due to the mutational

mechanism originating STR diversity, which fits admittedly

some variant of a basic step-wise model, in which each allelic

state originates most of the times new ones, one repeat (þ/�)

away. So, it overlooks re-recurrence phenomena, and it does

not take into account intra-haplogroup homoplasy. De Knijff

[4] used the combined Y-SNP/Y-STR approach to estimate

the proportion of IBS, and obtained values of 0.8% when

studying 6 Y-STRs and 4 Y-BMs and 0.2% when raising the

number of Y-STRs to 8. Unfortunately, combined information

for both Y-STRs and Y-BMs in the same sample is very rare,

and it has been therefore impossible to measure the proportion

of IBS, at a large scale, by this approach.

We suggest an approach based upon mismatch distribu-

tions/haplotype pairwise comparison that evaluates the pro-

portion of Y-STR haplotypes on which mutation can

originate IBS, in the next generation, and we report its

distribution at a European scale.

2. Material and methods

2.1. Sample

A total of 1477 different Y-STR haplotypes in 5529

individuals were collected (by 11 April 2001) from the Y-

STR Haplotype Reference Database (http://ystr.charite.de)

for populations displayed in Table 1 (Zeeland sample was

not used in this study due to its small size and reduced

diversity). The Y-STRs considered for analysis were DYS19,

DYS389I, DYS389II, DYS390, DYS391, DYS392 and

DYS393, and this order will be maintained below. In our

calculations DYS389II number of repeats were recorded as

the difference between the overall repeat number of the

amplicon (as registered in the database) and the number of

repeats at DYS389I. We exemplify the calculations in the

population that we called The Netherlands (n ¼ 229) and

that corresponds to the sum of the samples Holland (n ¼ 87),

Friesland (n ¼ 44), Groningen (n ¼ 48) and Limburg

(n ¼ 50) deposited in the referred database.

2.2. Statistical analyses

Mismatch distributions were obtained using the ARLE-

QUIN 2.0 population genetic software [5]. It is possible to

obtain two types of mismatch distributions considering

either just the different allelic states at each locus (irrespec-

tive of their number of repeat differences) or the number of

repeat units at which two haplotypes are different over all

loci under consideration. In practice, this can be done using

two types of input files, respectively: (1) the ‘‘Microsat’’

input file, and (2) the ‘‘DNA’’ input file, where each repeat

unit was considered as an ambiguous position (N) and

differences in the number of repeat units were considered

as insertion (N) or deletion (�).

To calculate the distance in repeat units of a particular

haplotype to all the haplotypes of a sample A, the mismatch

distribution analysis can be applied by constructing an

artificial sample B, in which the haplotype under analysis

is added to the original sample A (either as another instance

of a previous occurrence or as a new one). The distribution of

the differences of that haplotype to all the haplotypes

observed in the sample can be obtained by subtracting the

mismatch distribution of A from the one of B.

3. Results and discussion

3.1. Mismatch distribution—estimation of the molecular

distance between pairs of haplotypes

The evaluation of matches in the simple dichotomic

framework of equal/not-equal haplotypes overlooks that

both categories are heterogeneous. In the first category,

we are including both IBD and IBS haplotypes, while in

the second, the molecular distance between haplotypes is not

taken into account, thus giving the same weight to differ-

ences in one/several repeat units at one/several loci.

Mismatch distributions are histograms representing, for a

set of sequences or haplotypes, the frequency of pairwise

comparisons having 0, 1, . . ., nmax differences, where nmax is

the number of differences between the most different hap-

lotypes. In population genetics they have been used as a way

to represent graphically the amount of diversity in a sample.

If the distribution is L-shaped it indicates that many

sequences are identical with much fewer comparisons with

sequences that are genetically different. If it is bell-shaped,

with mean k, it means that many comparisons are more or

less distant by the same amount k. Moreover, it has been

shown that the shape of mismatch distributions is highly

influenced by ancient demographic events and selective

sweeps [6,7]. In that way, a mismatch distribution is an

148 L. Pereira et al. / Forensic Science International 130 (2002) 147–155

Table 1

Y-STR diversity and proportions (%) of the haplotype pool prone to differential-by-sate (DBS) and identical-by-sate (IBS) in Europe

Population N No.

haplotype

Gene diversity �standard error

Class 0

(%)

% of the pool

prone to DBS

Class 1

(%)

% of the pool

prone to IBS

Northern Portugal 182 105 0.982 � 0.004 1.85 0.082 6.70 0.020

Spain 365 208 0.983 � 0.003 1.69 0.075 5.77 0.018

Asturias 90 65 0.983 � 0.007 1.72 0.076 6.37 0.019

Galicia 103 71 0.980 � 0.007 1.96 0.087 5.41 0.016

Granada 52 43 0.989 � 0.008 1.13 0.050 6.11 0.019

Zaragoza 120 86 0.984 � 0.006 1.60 0.071 5.78 0.018

Italy 553 342 0.991 � 0.002 0.92 0.041 3.02 0.009

Modena 99 70 0.977 � 0.009 2.27 0.101 6.56 0.020

NWItaly 131 98 0.983 � 0.007 1.71 0.076 3.69 0.011

Rome 216 161 0.995 � 0.002 0.53 0.024 2.38 0.007

Toscany 107 87 0.993 � 0.003 0.74 0.033 2.56 0.008

Innsbruck—Austria 135 94 0.990 � 0.003 0.98 0.043 3.45 0.010

Budapest—Hungary 117 94 0.996 � 0.002 0.43 0.019 2.18 0.007

Switzerland 199 141 0.989 � 0.003 1.11 0.049 4.26 0.013

Bern 91 68 0.990 � 0.004 1.03 0.046 4.15 0.013

Lausanne 108 86 0.989 � 0.004 1.09 0.048 4.29 0.013

Germany 2096 748 0.992 � 0.001 0.83 0.037 3.18 0.010

Berlin 429 254 0.992 � 0.001 0.80 0.036 2.95 0.009

Cologne 135 98 0.989 � 0.004 1.07 0.047 3.57 0.011

Dusseldorf 150 98 0.984 � 0.004 1.63 0.072 5.32 0.016

Freiburg 211 129 0.987 � 0.003 1.35 0.060 4.84 0.015

Hamburg 114 101 0.997 � 0.002 0.26 0.012 1.74 0.005

Leipzig 405 236 0.995 � 0.001 0.49 0.022 2.07 0.006

Magdburg 177 120 0.991 � 0.002 0.91 0.040 3.27 0.010

Mainz 104 75 0.990 � 0.003 1.03 0.046 3.68 0.011

Munich 251 156 0.989 � 0.002 1.14 0.051 3.74 0.011

Munster 58 42 0.981 � 0.010 1.94 0.086 6.17 0.019

Stuttgart 61 50 0.991 � 0.005 1.03 0.046 4.77 0.015

Belgium 97 70 0.985 � 0.006 1.55 0.069 5.61 0.017

The Netherlands 229 123 0.980 � 0.004 2.05 0.091 6.38 0.019

Friesland 44 34 0.985 � 0.009 1.48 0.066 6.66 0.020

Groningen 48 30 0.963 � 0.015 2.75 0.122 4.58 0.014

Holland 87 64 0.983 � 0.006 1.74 0.077 5.91 0.018

Limburg 50 33 0.972 � 0.011 2.78 0.123 7.92 0.024

Poland 596 203 0.985 � 0.002 1.51 0.067 5.83 0.018

Bydgoszcs 168 103 0.987 � 0.003 1.26 0.056 5.69 0.017

Northern Poland 150 92 0.985 � 0.004 1.46 0.065 5.56 0.017

Warsaw 157 97 0.985 � 0.004 1.52 0.067 5.19 0.016

Wroclaw 121 75 0.983 � 0.005 1.75 0.078 7.34 0.022

Estonia 133 93 0.987 � 0.004 1.31 0.058 3.68 0.011

Latvia 145 100 0.991 � 0.003 0.93 0.041 4.53 0.014

Lithuania 151 100 0.988 � 0.003 1.17 0.052 4.73 0.014

Moscou—Russia 85 55 0.978 � 0.008 2.24 0.099 6.44 0.020

Norway 300 164 0.986 � 0.003 1.41 0.063 3.75 0.011

Norway Central 48 39 0.986 � 0.009 1.42 0.063 3.63 0.011

Norway East 85 57 0.982 � 0.006 1.82 0.081 5.66 0.017

Norway North 45 39 0.992 � 0.007 0.81 0.036 3.54 0.011

Norway Oslo 33 26 0.983 � 0.012 1.70 0.075 3.41 0.010

Norway South 25 23 0.990 � 0.016 1.00 0.044 2.67 0.008

Norway West 64 53 0.991 � 0.006 0.89 0.039 3.32 0.010

Buenos Aires 100 76 0.988 � 0.005 1.23 0.055 2.87 0.009

Class 0: equal haplotypes; class 1: haplotypes differing in one repeat unit.

L. Pereira et al. / Forensic Science International 130 (2002) 147–155 149

easy way of obtaining the molecular distance between all

pairs of haplotypes in a certain population. This strategy

solves the difficulty to evaluating the molecular proximity

between haplotypes—visually it is impossible to grasp the

multidimensional network (for instance, pooling each locus

in a axis, it would be necessary to define as many dimensions

as the number of loci considered for the haplotype; as

pointed out by Krawczak, personal communication) that

connects all the haplotypes in a sample, which makes Y-

STR networks impossible to interpret [8].

Fig. 1 represents both types of mismatch distributions in

The Netherlands. It is noteworthy that 86.8% (6.38%/7.35%)

of the differences between haplotypes differing at one locus

consist in a single repeat unit.

In the observed mismatch distribution there are two

classes that are important when considering the dichotomy

equal/not-equal, namely: (a) class 0, for the proportion of

identical haplotypes, and (b) class 1, for the proportion of

haplotypes differing by a single step. Since mutation in STRs

occurs mainly by affecting one repeat unit each time, it will

be upon these two classes that mutation will act in the next

generation, creating haplotypes different-by-state (DBS) and

IBS, respectively.

3.2. Estimation of IBS and DBS

The probability of a certain Y-STR haplotype becoming

IBS to another one, in the next generation, will be obtained by

single step mutation for pairs of haplotypes differing in one

repeat unit in a certain locus. Class 1 of ‘‘DNA’’ mismatch

distribution, gives the proportion of pairs of haplotypes that

differ in one repeat unit in a certain locus, and hence, those

that are prone to become IBS in the next generation.

The probability of class 1 to class 0 transition (i.e. of two Y-

STR haplotype becoming IBS) per generation can be roughly

estimated as mð1 � mÞ2n�1f1, where n is the number of loci

defining the haplotype (seven in our case), m the average

mutation rate (3:17 � 10�3) [9], and f1 the frequency of class

1. The rationale behind the formula is briefly the following.

Inside class 1 there are n different subclasses corresponding to

each of the cases: haplotypes differing just by one repeat at the

first locus, at the second, and so on. For simplicity sake we

will assume that all loci are equally diverse (but it can easily

be demonstrated that this assumption is not essential) and then

class 1 is divisible into equally frequent n subclasses (f1/n).

Considering now the first locus, for the haplotype pair, there is

a chance m/2 of the longer one losing one repeat (but mutation

cannot occur in the other member of the pair, an event with

probability 1 � m) and the same probability for the shorter

one gaining it (not considering the slight bias that seems to

favour the gain [9]). Therefore, the overall probability can be

estimated as ðm=2Þð1 � mÞ þ ð1 � mÞðm=2Þ ¼ mð1 � mÞ.The transitional probability for just one locus will thus be

given by mð1 � mÞf1=n. Analysing now two loci, there will be

two subclasses to consider: those pairs differing in the first

(but not in the second) locus and vice versa, each with

frequency f1/n. Let’s calculate the probability for the first

kind of pair; the only events that would lead to IBS require

that mutation occurs (in the right directions as above) in just

one of the members of the pair at a time (which now

encompasses four alleles, two at each haplotype), so

m1ð1 � mÞ3. Adding up the other pair, the transition prob-

ability for two loci would be 2m1ð1 � mÞ3f1=n. Thus, gen-

eralising to n loci it turns into nmð1 � mÞ2n�1f1=n. For The

Netherlands the corresponding is value is 0.0194%.

There are more complex and rare phenomena contribut-

ing to IBS, but comparatively insignificant. For instance, the

probability of IBS resulting from mutations involving two

repeat units in one locus would be m2ð1 � m2Þ2n�1f1i, where

m2 is the average mutation rate for two repeat units in one

locus (3:17 � 10�4, since this kind of mutation is roughly 10

times less frequent than the single step) [9], and f1i the

frequency of pairs of haplotypes differing by two steps at one

locus (an overestimation can be obtained by the difference

between values observed for classes 1 from both mismatch

distributions, 0.97%, which in fact corresponds to the pro-

portion of haplotypes differing by more than one repeat unit

in one locus), which gives a value of 0.000306%, which is

approximately 50 times less than the probability calculated

above.

The probability of becoming DBS can be approximated

by the formula 2nmf0, where f0 is the frequency of class 0.

Fig. 1. Mismatch distributions for the number of loci (A) and number of repeat unit (B) pairwise differences for Y-STRs in The Netherlands

sample.

150 L. Pereira et al. / Forensic Science International 130 (2002) 147–155

Indeed, when comparing a pair of equal haplotypes, all the

mutations, either losing or gaining repeat units, in any locus,

will turn the pair different. On the other hand, the probability

of occurrence of the same mutation in both haplotypes (m2)

is nearly zero. The corresponding value for The Netherlands

would be 0.0910%.

3.3. Estimation of IBS, in the next generation, in an

European scale

The results for the proportion of the Y-STR pool that is

prone to become IBS, in the next generation, at a European

scale are presented in Table 1.

There is an apparent westward and northward gradient

showing an increase of the propensity for IBS, in the next

generation, from a focus in the southern-central Europe

(Fig. 2). Hungary and Italy display the lowest values, around

one half of those observed in the more extreme western and

northern regions.

These gradients are concordant, as expected, with the

distribution of gene diversity values, not only in Y-STRs

(Table 1), but also for Y-BMs (Fig. 2), described in a similar

area by Rosser et al. [10]. Southern-central Europe is located

in a convergent area of differently oriented Y-BM clines,

from the southeast into the northwest [11]. So, a lower

probability of haplotypes becoming IBS within-haplogroups

is expected, since the number of individuals inside hap-

logroups is also low; in opposition the propensity for IBS

haplotypes between haplogroups is predictably higher, since

there is a larger number of haplogroups.

3.4. Mismatch distribution as a tool to measure the

distance of a particular haplotype to the database

As already pointed out, there is a great bias when analys-

ing a match in the classical fashion, by checking in the

database if there is a match or not, and if so reporting the

haplotype frequency. This method disregards the fact of that

Fig. 2. Values of Y-STR IBS (in %) and Y-BM diversity (italic) at an European scale.

Fig. 3. Mismatch distribution for the differences in repeat units between each of the Y-STR haplotypes belonging to cluster 1 (A) and to

cluster 2 (B), respectively, and all the haplotypes in Europe.

L. Pereira et al. / Forensic Science International 130 (2002) 147–155 151

particular haplotype being molecularly close to some or

several haplotypes, and hence being prone by mutation to be

IBS or DBS. By mismatch distribution analysis, it is possible

to calculate the molecular distance between one particular

haplotype and all the remaining haplotypes in a certain

population.

We applied this strategy to the European database. Given

the mutational mechanism generating STR diversity, it is

obviously expected that the haplotype with more molecu-

larly close haplotypes will be the most frequent one. So let us

suppose that we are investigating a match for the haplotype

14-13-16-24-11-13-13 (E0451), that occurs in the European

sample with a frequency of 5.14%; calculating the molecular

distance to all the haplotypes in the European sample

(Fig. 3A and Table 2), we observe that there is a substantial

proportion (11.03%) of haplotypes that differ only by one-

step, rendering very high the probability for IBS or DBS. Let

us now take a look to the properties of its closest haplotypes,

that is, all the haplotypes that are one-step neighbours to it

(which will be called cluster 1, and that reaches globally the

frequency of 16.17%)—all show a high frequency for class

1, and so, a significant propensity for IBS. We extended the

same analysis to the one-step neighbours of haplotypes 14-

13-16-23-11-13-13 (E0414) and 14-13-16-24-10-13-13

(E0435), which are respectively the second and third most

frequent European haplotypes, and that were one-step neigh-

bours to E0451 (those clusters will be called cluster 1.a and

1.b, respectively). Summing clusters 1, 1.a and 1.b, a

frequency of 23.75% is obtained, being, for sure, the main

reason for the first lump in the bimodal distribution dis-

played by almost all the haplotypes of cluster 1 (Fig. 3A) and

clusters 1.a and 1.b (not shown); while the second lump

represents the pairwise comparisons with the remaining

haplotypes in the sample.

Besides those haplotypes, the fourth more frequent one is:

14-12-16-22-10-11-13 (E0216; frequency of 2.30% in Eur-

ope), which is six steps away from E0451 (Table 3). Together

with its one-step neighbours (cluster 2) contributes to only

6.20% of the European sample. And the main point is that

comparing haplotypes from cluster 1 with cluster 2, which

have comparable frequency, we see that the proportion of

class 1 is very different between both (e.g. E0435 of cluster 1

with a frequency of 2.44 has 9.75% for class 1 while E0216

of cluster 2 with a frequency of 2.30% has 3.91% for class 1).

This shows how biased is to give information based in the

frequency of the particular haplotype, disregarding the

information of the proportion of neighbouring haplotypes

in the sample.

Notice that mismatch distributions for clusters 2 and 2.a

(haplotype E0230 and its one-step neighbours) are unimodal

(Fig. 3B; not shown for cluster 2.a). This pattern is similar to

the one displayed by the overall haplotype pairwise com-

parison in the European sample (Fig. 4). It seems that the

bimodal shape displayed by the main class of haplotypes in

Europe resembles the bimodal pattern displayed by the Y-

chromosome biallelic markers [11], but the overall Y-STR

European sample is already unimodal, probably due to the

high mutation rate in STRs.

Although, as stressed before, the overall Y-STR diversity

is not structured in Europe [2], once again, there seems to be

Table 2

Frequencies of a particular Y-STR haplotype (class 0) and its one-

step neighbours (class 1) inside clusters 1, 1.a and 1.b in the

European sample

N Class 0

(%)

Class 1

(%)

E0451 14-13-16-24-11-13-13 284 5.14 11.03

E0414 14-13-16-23-11-13-13 149 2.69 9.26

E0435 14-13-16-24-10-13-13 135 2.44 9.75

E0640 14-14-16-24-11-13-13 64 1.16 7.22

E0530 14-13-17-24-11-13-13 50 0.90 7.16

E0475 14-13-16-25-11-13-13 46 0.83 6.53

E0908 15-13-16-24-11-13-13 35 0.63 6.85

E0450 14-13-16-24-11-13-12 32 0.58 6.26

E0256 14-12-16-24-11-13-13 19 0.34 6.02

E0362 14-13-15-24-11-13-13 16 0.29 6.37

E0452 14-13-16-24-11-13-14 16 0.29 5.81

E0446 14-13-16-24-11-12-13 16 0.29 5.77

E0460 14-13-16-24-12-13-13 12 0.21 5.53

E0454 14-13-16-24-11-14-13 11 0.20 5.91

E0069 13-13-16-24-11-13-13 9 0.16 5.39

Cluster 1: 894, 16.17%

E0403 14-13-16-23-10-13-13 74 1.34 6.53

E0621 14-14-16-23-11-13-13 33 0.60 4.43

E0349 14-13-15-23-11-13-13 26 0.47 3.64

E0508 14-13-17-23-11-13-13 19 0.34 4.58

E0883 15-13-16-23-11-13-13 17 0.31 3.73

E0416 14-13-16-23-11-14-13 13 0.24 3.18

E0241 14-12-16-23-11-13-13 10 0.18 3.42

E0420 14-13-16-23-12-13-13 9 0.16 3.09

E0411 14-13-16-23-11-12-13 8 0.14 3.18

E0413 14-13-16-23-11-13-12 7 0.13 3.60

E0389 14-13-16-22-11-13-13 7 0.13 3.07

E0415 14-13-16-23-11-13-14 4 0.07 3.36

E0061 13-13-16-23-11-13-13 1 0.02 2.95

Cluster 1.a (without E0451 and E0414): N ¼ 228, 4.12%

E0521 14-13-17-24-10-13-13 30 0.54 4.32

E0468 14-13-16-25-10-13-13 27 0.49 3.85

E0901 15-13-16-24-10-13-13 25 0.45 3.60

E0632 14-14-16-24-10-13-13 21 0.38 4.27

E0434 14-13-16-24-10-13-12 17 0.31 3.53

E0357 14-13-15-24-10-13-13 15 0.27 3.27

E0250 14-12-16-24-10-13-13 13 0.24 3.17

E0439 14-13-16-24-10-14-13 11 0.20 3.04

E0436 14-13-16-24-10-13-14 10 0.18 3.00

E0432 14-13-16-24-10-12-13 8 0.14 3.07

E0067 13-13-16-24-10-13-13 3 0.05 2.82

E0424 14-13-16-24-11-13-13 1 0.02 2.51

Cluster 1.b (without E0435, E0451, E0414 and E0403): N ¼ 181,

3.27%

Total 1303, 23.57%

For clusters’ definition see text.

152 L. Pereira et al. / Forensic Science International 130 (2002) 147–155

a non-random geographical pattern for the distribution of

cluster 1 in Europe (Fig. 5A). It is clearly less frequent in the

eastern portion of the continent and seems to increase in

frequency from south-central Europe towards west and

north, resembling the pattern just described for propensity

to IBS. Cluster 2 (Fig. 5B) is more frequent in north-central

Europe, and decreases towards west and east. The compar-

ison of the diversities of the two clusters, suggests that

cluster 1 is more ancient than cluster 2.

4. Final considerations

STRs have been markers of choice used in forensics due

to its high polymorphism. SNPs, although presenting high

stability, are much less polymorphic, becoming less infor-

mative in the forensic field, while more advanced technol-

ogies allowing its easy and fast typing are not available. So,

extensive forensic databases available for matching are now

(and will be maintained for more years) exclusively based on

STRs.

It seems, therefore, that the approach presented here can

be useful in the forensic field, for the evaluation of the

significance and evidential value of Y-STR haplotype

matches. This importance is best illustrated by comparing

the statistics of haplotypes with distinctive situations in

terms of abundance of neighbouring haplotypes. For that

we take as an example two non-observed haplotypes, 14-12-

16-22-09-11-13 and 19-15-15-25-10-11-13, which were

chosen because they lay respectively nearby cluster 2 and

more than one-step away from the observed haplotypes.

While their estimated frequencies, by the surveying method

as trace haplotypes are 3:0 � 10�4 and 1:4 � 10�4, respec-

tively, their probabilities of occurrence in the next generation

by one-step mutation are 2.33 and 0%, respectively.

This approach for IBS evaluation has different assump-

tions relatively to the one based in the joined Y-STR/Y-BM

information. The first is limited to one generation while the

second is multigenerational; but while the second has to

assume only IBS between-haplogroups, the first considers

both, although cannot distinguish them (except if we have

assess also to Y-BM information). The comparison between

the two values obtained for the Netherlands population

(0.0194% for the 7 Y-STR approach and 0.2% for 8Y-

STR/4Y-BM) shows that roughly 1/10 of the observed

IBS is attributable to one-generational events. This finding

is particularly important in the forensic field, where we are

interested in recent ancestry (kinship 0 in criminalistic or 1

in paternity), and pairs of haplotypes will be more similar

within-haplogroups, so more prone to IBS, than pairs of

haplotypes between-haplogroups.

A main issue is the use of large scaled databases for

broad matching survey, which legitimacy is based in the

claim that there is no significant population substructuring

for STRs in Europe (contrarily to SNPs). Here, we demon-

strate that this claim is at least debatable, since the oppor-

tunity for IBS varies a lot across Europe, as judged from the

mismatch distributions, and in a way that is congruent with

the well defined SNPs haplogroups’ gradients: higher

SNP’s diversity in south-central Europe associated with

Table 3

Frequencies of a particular Y-STR haplotype (Class 0) and its one-

step neighbours (Class 1) inside clusters 2 and 2.a in the European

sample

N Class 0

(%)

Class 1

(%)

E0216 14-12-16-22-10-11-13 127 2.30 3.91

E0230 14-12-16-23-10-11-13 94 1.70 3.67

E0730 15-12-16-22-10-11-13 31 0.56 3.78

E0217 14-12-16-22-10-11-14 21 0.38 2.98

E0381 14-13-16-22-10-11-13 18 0.33 2.95

E0272 14-12-17-22-10-11-13 17 0.31 3.69

E0223 14-12-16-22-11-11-13 10 0.18 2.66

E0215 14-12-16-22-10-11-12 8 0.14 2.60

E0195 14-12-15-22-10-11-13 5 0.09 2.41

E0207 14-12-16-21-10-11-13 4 0.07 2.39

E0189 14-11-16-22-10-11-13 3 0.05 2.32

E0219 14-12-16-22-10-12-13 2 0.04 2.53

E0212 14-12-16-22-10-10-13 2 0.04 2.35

E0007 13-12-16-22-10-11-13 1 0.02 2.35

NO1 14-12-16-22-09-11-13 0 0 2.33

Cluster 2: N ¼ 343, 6.20%

E0280 14-12-17-23-10-11-13 24 0.43 2.56

E0396 14-13-16-23-10-11-13 14 0.25 3.07

E0741 15-12-16-23-10-11-13 13 0.24 2.75

E0229 14-12-16-23-10-11-12 7 0.13 2.41

E0246 14-12-16-24-10-11-13 7 0.13 2.08

E0231 14-12-16-23-10-11-14 4 0.07 2.24

E0237 14-12-16-23-11-11-13 3 0.05 2.08

E0233 14-12-16-23-10-12-13 1 0.02 1.95

E0199 14-12-15-23-10-11-13 1 0.02 1.92

E0190 14-11-16-23-10-11-13 1 0.02 1.77

E0228 14-12-16-23-09-11-13 1 0.02 1.75

NO2 13-12-16-23-10-11-13 0 0 1.81

NO3 14-12-16-23-10-10-13 0 0 1.74

Cluster 2.a (without E0230 and E0216): N ¼ 76, 1.37%

Total 419, 7.57%

NO: not observed.

Fig. 4. Mismatch distribution for the differences in repeat units for

the Y-STR haplotypes in Europe.

L. Pereira et al. / Forensic Science International 130 (2002) 147–155 153

lower proportion of haplotype pairs prone to IBS; and

decreasing SNP’s diversity towards west and north asso-

ciated with higher risk of recurrence. This is also in

agreement with findings by Chikhi et al. [12], who have

shown the existence of significant clines across Europe for

microsatellites and minisatelites. So, if there is no popula-

tion structure for Y-STR diversity at European scale, the

same can be not said relatively to the propensity for IBS.

The use of mismatch distribution to measure the mole-

cular distance between a particular haplotype and all the

haplotypes in a sample showed, as expected, that the

propensity for IBS is largely dependent on the proportions

of neighbouring haplotypes. So, evaluation of a match only

in terms of haplotype frequency can be a very biased

estimation, although, of course, the cloud of related hap-

lotypes is dependent on the frequency—a much more

informative estimate could be given in the form of how

frequently one-step neighbour haplotypes are present in the

sample.

Acknowledgements

This work was partially supported by a research grant

(PRAXIS XXI BD/13632/97) from Fundacao para a Ciencia

e a Tecnologia and IPATIMUP by Programa Operacional

Ciencia, Tecnologia e Inovacao (POCTI), Quadro Comuni-

tario de Apoio III. We wish to thank Mark Beaumont,

Fig. 5. Frequency distribution of the Y-STR haplotypes belonging to the clusters 1 (A) and 2 (B) in Europe.

154 L. Pereira et al. / Forensic Science International 130 (2002) 147–155

Lounes Chikhi and Richard Nichols for comments on earlier

versions of the manuscript.

References

[1] L. Roewer, M. Krawczak, S. Willuweit, M. Nagy, C. Alves,

A. Amorim, K. Anslinger, C. Augustin, A. Betz, E. Bosch, A.

Caglia, A. Carracedo, D. Corach, A.F. Dekairelle, T. Dobosz,

B.M. Dupuy, S. Furedi, C. Gehrig, L. Gusmao, J. Henke, L.

Henke, M. Hidding, C. Hohoff, B. Hoste, M.A. Jobling, H.J.

Kargel, P. De Knijff, R. Lessig, E. Liebeherr, M. Lorente, B.

Martinez-Jarreta, P. Nievas, M. Nowak, W. Parson, V.L.

Pascali, G. Penacino, R. Ploski, B. Rolf, A. Sala, U. Schmidt,

C. Schmitt, P.M. Schneider, R. Szibor, J. Teifel-Greding, M.

Kayser, Online reference database of European Y-chromoso-

mal short tandem repeat (STR) haplotypes, Forensic Sci. Int.

118 (2001) 106–113.

[2] L. Roewer, M. Kayser, P. De Knijff, K. Anslinger, A. Betz, A.

Caglia, D. Corach, S. Furedi, L. Henke, M. Hidding, H.J.

Kargel, R. Lessig, M. Nagy, V.L. Pascali, W. Parson, B. Rolf,

C. Schmitt, R. Szibor, J. Teifel-Greding, M. Krawczak, A new

method for the evaluation of matches in non-recombining

genomes: application to Y-chromosomal short tandem repeat

(STR) haplotypes in European males, Forensic Sci. Int. 114

(2000) 31–43.

[3] M. Krawczak, Forensic evaluation of Y-STR haplotype

matches: a comment, Forensic Sci. Int. 118 (2001) 114–115.

[4] P. De Knijff, Y chromosome shared by descent or by state, in:

C. Renfrew, K. Boyle (Eds.), Archaeogenetics: DNA and the

Population Prehistory of Europe, McDonald Institute Mono-

graphs, Oxbow Books, Cambridge, 2000, pp. 301–304.

[5] S. Schneider, D. Roessli, L. Excoffier, Arlequin ver. 2.0: a

software for population genetic data analysis, Genetics and

Biometry Laboratory, University of Geneva, Switzerland, 2000.

[6] M. Slatkin, R.R. Hudson, Pairwise comparisons of mitochon-

drial DNA sequences in stable and exponentially growing

populations, Genetics 129 (1991) 555–562.

[7] A.R. Rogers, H. Harpending, Population growth makes waves

in the distribution of pairwise genetic differences, Mol. Biol.

Evol. 9 (1992) 552–569.

[8] L. Roewer, M. Kayser, P. Dieltjes, M. Nagy, E. Bakker, M.

Krawczak, P. De Knijff, Analysis of molecular variance

(AMOVA) of Y-chromosome-specific microsatellites in two

closely related human populations, Hum. Mol. Genet. 5

(1997) 1029–1033.

[9] M. Kayser, L. Roewer, M. Hedman, L. Henke, J. Henke, S.

Brauer, C. Kruger, M. Krawczak, M. Nagy, T. Dobosz, R.

Szibor, P. De Knijff, M. Stoneking, A. Sajantila, Character-

istics and frequency of germline mutations at microsatellite

loci from the human Y chromosome, as revealed by direct

observation in father/son pairs, Am. J. Hum. Genet. 66 (2000)

1580–1588.

[10] Z.H. Rosser, T. Zerjal, M.E. Hurles, M.A. Adojaan, D.

Alavantic, A. Amorim, W. Amos, M. Armenteros, E. Arroyo,

G. Barbujani, L. Beckman, J. Bertranpetit, E. Bosch, D.G.

Bradley, G. Brede, G.C. Cooper, H.B.S.M. Corte-Real, P. De

Knijff, R. Decorte, Y.E. Dubrova, O. Evgrafov, A. Gilissen,

S. Glisic, M. Golge, E.W. Hill, A. Jeziorowska, L.

Kalaydjieva, M. Kayser, S.A. Kravchenco, J. Lavinha, L.A.

Livshits, S. Maria, K. McElreavey, T.A. Meitinger, B.

Melegh, R.J. Mitchell, J. Nicholson, S. Nørby, A. Novelletto,

A. Pandya, J. Parik, P.C. Patsalis, L. Pereira, B. Peterlin, G.

Pielberg, M.J. Prata, C. Previdere, K. Rajczy, L. Roewer, S.

Rootsi, D.C. Rubinsztein, J. Saillard, F.R. Santos, M.

Shlumukova, G. Stefanescu, B.C. Sykes, A. Tolun, R.

Villems, C. Tyler-Smith, M.A. Jobling, Y-chromosomal

diversity within Europe is clinal and influenced primarily

by geography rather than language, Am. J. Hum. Genet. 67

(2000) 1526–1543.

[11] L. Pereira, I. Dupanloup, Z.H. Rosser, M.A. Jobling, G.

Barbujani, Y-chromosome mismatch distributions in Europe,

Mol. Biol. Evol. 18 (2001) 1259–1271.

[12] L. Chikhi, G. Destro-Bisol, V. Pascali, V. Baravelli, M.

Dobosz, G. Barbujani, Clinal variation in the DNA of

Europeans, Hum. Biol. 70 (1998) 643–657.

L. Pereira et al. / Forensic Science International 130 (2002) 147–155 155