estimating heritability using genomic data
TRANSCRIPT
Estimating heritability using genomic data
JohnStanton-Geddes1*†, JeremyB. Yoder1, RomanBriskine2, Nevin D. Young1,3 and
Peter Tiffin1*
1Department of Plant Biology, University of Minnesota, Saint Paul, MN 55108, USA; 2Department of Computer Science and
Engineering, University ofMinnesota,Minneapolis, MN 55455, USA; and 3Department of Plant Pathology, University of
Minnesota, Saint Paul, MN55108, USA
Summary
1. Heritability (h2) represents the potential for short-term response of a quantitative trait to selection. Unfortu-
nately, estimating h2 through traditional crossing experiments is not practical for many species, and even for
those in which mating can be manipulated, it may not be possible to assay them in ecologically relevant environ-
ments.
2. We evaluated an approach,GCTA, that uses relatedness estimated from genomic data to estimate the propor-
tion of phenotypic variance due to genotyped SNPs, which can be used to infer h2. Using phenotypic and geno-
typic data from eight replicates of experimentally grown plants of the annual legume Medicago truncatula, we
examined how h2 estimates from GCTA (h2GCTA) related to traditional estimates of heritability (clonal repeat-
ability for these inbred lines). Further, we examined how h2GCTA estimates were affected by SNP number, minor
allele frequency, the number of individuals assayed and the exclusion of causative SNPs.
3. We found that the average h2GCTA estimates for each trait made with the full data set (>5 million SNPs, 200
individuals) were strongly correlated (r = 0�99) with estimates of clonal repeatability. However, this result
masks considerable variation among replicate estimates of h2GCTA, even in relatively uniform greenhouse condi-
tions. h2GCTA estimates with 250 000 and 25 000 SNPs were very similar to those obtained with >5 million
SNPs, but with 2500 SNPs, h2GCTAwere lower and had higher variance than those with ≥25 k SNPs. h2GCTA esti-
mates were slightly lower when only common SNPs were used. Excluding putatively causative SNPs had little
effect on the estimates of h2GCTA, suggesting that genotyping putatively causative SNPs is not necessary to obtain
accurate estimates of h2. The number of accessions sampled had the greatest effect on h2GCTA estimates, and vari-
ance greatly increased as fewer accessions were included. With only 50 accessions sampled, the range of h2GCTA
ranged from 0 to 1 for all traits.
4. These results indicate that the GCTAmethod may be useful for estimating h2 using data sets of a size that are
available from reduced-representation genotyping but that hundreds of individuals may need to be sampled to
obtain robust estimates of h2.
Key-words: Genomic selection, GWAS, heritability, quantitative genetics
Introduction
The vast majority of ecologically important traits are quantita-
tive or complex traits. For these traits, the genetic basis is too
complex to be untangled using traditional molecular genetic
approaches, although genome-wide association mapping is
being used to identify individual genes that contribute to the
variation. Given the inability to map phenotypic variation of
complex traits to the individual molecular determinants, statis-
tical approaches based on relatedness among individuals have
been used to disentangle the relative contributions of genetics
and environment to phenotypic variation (Fisher 1918;
Falconer &Mackay 1996), and predicting expected evolution-
ary responses to selection (Lande & Arnold 1983; Shaw et al.
2008). The effectiveness of formal statistical approaches for
predicting response to selection without explicitly identifying
the molecular mechanisms is demonstrated by the advances
achieved by plant and animal breeders in the past 100 years
(Moose, Dudley&Rocheford 2004).
A central parameter in estimating responses to selection and
summarizing the proportion of variance due to genetics is heri-
tability (h2) (Wright 1920; Falconer &Mackay 1996; Lynch &
Walsh 1998; Hill 2010). Traditionally, h2 has been estimated
through pedigree analyses or using individuals of known rela-
tionships established by experimental crosses. Although rea-
sonable for short-lived species that can be reared in common
environments, experimental crosses are considerably more
challenging for species that are long lived, very large or are not
amenable to controlled mating. In addition, because heritabil-
ity is dependent upon the environment in which organisms are
reared, h2 estimates made under controlled conditions may not
*Correspondence author. E-mails: [email protected]; ptiffin@umn.
edu
†Current address: Department of Biology, University of Vermont, 102
MarshLife Sciences, 109 CarriganDr, Burlington, VT 05405,USA
© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society
Methods in Ecology and Evolution 2013, 4, 1151–1158 doi: 10.1111/2041-210X.12129
be good estimators of h2 in natural conditions (Geber & Grif-
fen 2003), although a survey of estimates from animals sug-
gests that laboratory and field estimates are often quite similar
(Weigensberg&Roff 1996).
Recognizing these limitations, Ritland and colleagues
(Ritland 1996; Ritland & Ritland 1996; Lynch & Ritland
1999) developed methods for estimating h2 in the field. These
methods were based on linear relationships between marker-
based estimates of relatedness and phenotypes. However,
because of uncertainty in estimate of relatedness and con-
founding of relatedness with the environment, h2 estimates
from these methods have not been accurate (Coltman 2005;
Frentiu et al. 2008; Pemberton 2008; Gay, Siol & Ronfort
2013). In recent years, multiple methods have been developed
in the animal breeding literature that use large-scale genomic
data to predict phenotypes (Meuwissen, Hayes & Goddard
2001; Van Raden 2008; Goddard et al. 2009; Campos et al.
2012) and estimate heritability based on the proportion of
phenotypic variance explained by genotyped SNPs. Many of
these methods have focused on phenotype prediction (i.e.
genomic selection or genomic prediction) with the goal of
being able to predict phenotypes in order to speed up breed-
ing cycles and save money required for phenotyping (God-
dard & Hayes 2009; Jannink, Lorenz & Iwata 2010). Yang
et al. (2010, 2011a) modified these methods in the GCTA
software package to estimate the additive genetic variance for
a trait using genome-scale single nucleotide polymorphism
(SNP) data. This method advances Ritland (1996) by estimat-
ing relatedness with many thousands of markers and then
using estimated relatedness to infer the additive genetic vari-
ance (hereafter referred to as h2GCTA) of a trait. If the assayed
SNPs adequately capture the relationships among individuals
at causative alleles, h2GCTA is equivalent to narrow-sense heri-
tability (Yang et al. 2010). Yang et al. (2010) evaluated their
method by estimating the proportion of variance explained
by ~290 k common SNPs in 3925 humans, and after correct-
ing for SNPs that are not genotyped and those at lower minor
allele frequency, come quite close to the heritability of height
estimated from sib models, ~0�8. GCTA also has been used to
estimate heritability for human weight (Yang et al. 2011b),
intelligence (Davies et al. 2011), disease susceptibility (Lee
et al. 2012) and personality (Verweij et al. 2012). In addition,
similar methods have been use to partition genetic variation
in wing length (Robinson et al. 2013), clutch size and egg
mass of a wild bird population (Santure et al. 2013).
Genomic-based estimates of heritability, such as that imple-
mented in GCTA, together with the ability to collect genome-
scale polymorphism data through reduced-representation
genotyping such as RAD-tag, multiplexed shotgun genotyping
or genotype-by-sequencing (GBS) (Baird et al. 2008;
Andolfatto et al. 2011; Elshire et al. 2011), can make precise
estimates of heritability practical even for natural populations
of long-lived non-model species. Such estimates may be valu-
able for understanding evolution in natural populations and
predicting population responses to environmental perturba-
tions including ongoing climate change (Lavergne et al. 2010;
Shaw&Etterson 2012).
In this study, our overall objective was to use full-genome
sequence data and empirical estimates of heritability to evalu-
ate the effects of marker density and minor allele frequency,
sample size and the exclusion of causative SNPs on the perfor-
mance of the GCTA method. Our specific objectives were to
(i) compare h2 estimates obtained from replicated accessions
grown in a common environment with the h2GCTA estimates of
heritability, (ii) evaluate how estimates of h2GCTA are affected
by sample size, SNP density and minor allele frequency and
(iii) examine the effects excluding putatively causal genomic
regions has on h2GCTA estimates. We pursue these objectives
by subsampling a data set of ~6million SNPs identified by gen-
ome sequencing (Branca et al. 2011; Stanton-Geddes et al.
2013) and phenotypic data for six traits (flowering time, height,
trichome density, total nodules and rhizobia strain occupancy
of nodules above and below 5 cm root growth) from 226 acces-
sions of M. truncatula. Our evaluation of the method extends
that of Yang et al. (2010) by empirically examining the effect
of including uncommon SNPs and by evaluating the perfor-
mance of GCTAwith SNP densities and sample sizes that may
be obtained by evolutionary ecologists working in non-model
systems.
Materials andmethods
Medicago truncatula is highly selfing in nature (>95%; Bonnin et al.
2001; Siol et al. 2008) with a native range that extends around theMed-
iterranean (Ronfort et al. 2006). The genomic and phenotypic data
analysed here, the collection of which is described in full in Stanton-
Geddes et al. (2013), were obtained from 226 accessions ofM. trunca-
tula sampled from across a wide portion of the species range. In brief,
the genome of each accession was sequenced using Illumina technology
(90-bp reads, average coverage ~69 per accession) and mapped to the
eight chromosomes and two pseudomolecules (SNPs that could not be
mapped to a chromosome: T – tentative consensus sequences from the
Dana–Farber Cancer InstituteM. truncatula gene index v10.0 andU –
unanchoredBACs) of theM. truncatula v3.5 reference genome (Young
et al. 2011, www.medicagohapmap.org). Sequence data are available
from www.medicagohapmap.org. For analyses, we used 5�67 million
biallelic SNPs that were scored in ≥ 100 accessions.Phenotype data (previously described in Stanton-Geddes et al. 2013)
were obtained from 226 accessions grown in a fully randomized eight-
block greenhouse experiment, with each accession replicated once per
block. BecauseM. truncatula is highly selfing in nature and accessions
were selfed for two or more generations prior to extracting DNA for
sequencing and conducting the experiment, replicates are expected to
be nearly genetically identical. Each plant was inoculated with two
strains of rhizobia (M249 and KH46c) that differ in nodulation pheno-
types (Sugawara et al. 2013).We recorded date of first flower and plant
height at 10 weeks after flowering and counted all nodules on each of
1899 surviving plants (6�1–8 plants per accession depending on the
trait) at 11 weeks after harvesting. For plants from 6 blocks, we calcu-
lated the proportion of nodules occupied by rhizobia strain M249 in
the upper (top 5 cm) and lower roots using a dot-blot assay (Cregan,
Keyser & Sadowsky 1989). Trichome density was measured as the
number of trichomes visible at 109magnification along a 2 mmsection
of the petiole of 1 fully expanded leaf. For each trait, we calculated clo-
nal repeatability using linear mixed-effects (LMM) models imple-
mented in the rptR package (Schielzeth & Nakagawa 2011) in R
© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society, Methods in Ecology and Evolution, 4, 1151–1158
1152 J. Stanton-Geddes et al.
version 2.15 (R Core Team 2013). For height, trichome density and
flowering date we used a Gaussian distribution with MCMC sampling
(rpt.mcmcLMM), while we used aGLMMwith logit link andmul-
tiplicative overdispersion for the nodule number (count data:
rpt.poisGLMM.multi) and nodule occupancy (proportion
data: rpt.binomGLMM.multi) traits. For total nodules andnodule occupancy, repeatability is reported on the transformed (link)
scale. Results did not differ if we used an ANOVA or MCMC approach
(Stanton-Geddes 2013; Appendix 1).Clonal repeatability is a measure
of the among accession phenotypic variance and is equivalent to
broad-sense heritability, h2 (Lessells & Boag 1987; Nakagawa &
Schielzeth 2010). h2 is an upper bound on narrow-sense heritability (h2)
as it also includes effects due to dominance and epistasis (Falconer &
Mackay 1996; Lynch&Walsh 1998).
We used the program GCTA v1.04 (Yang et al. 2011a) to estimate
the proportion of phenotypic variance explained by genotyped SNPs.
The GCTA analysis consists of two steps. First, all SNPs are used to
calculate the genetic relationship matrix (GRM) among accessions.
GCTA uses the accessions included in the analysis as the base popula-
tion for defining relatedness, such that the average relatedness between
all ‘unrelated’ pairs of accessions (off-diagonals of GRM) is zero (Pow-
ell, Visscher & Goddard 2010; Yang et al. 2010). As we are working
with an inbred species, the average relatedness of accessions with them-
selves (diagonals of GRM) is equal to two, not one as for humans
(Yang et al. 2010). TheGRM is then used as a predictor in amixed lin-
ear model with a trait as the response to estimate h2GCTA. The GCTA
method estimates the proportion of additive genetic variance for a trait
and thus narrow-sense heritability and so should be lower than the h2
estimate of clonal repeatability. Scripts for GCTA analysis available in
Appendix 2.
In unmanipulated populations, genotypes likely will be represented
by only a single individual. Therefore, we calculated h2GCTA separately
for plants from each of the experimental blocks, producing eight esti-
mates of heritability for height, trichomes, flowering date, total nodules
and six estimates for the nodule strain occupancy traits. Thus, each
accession is only included a single time in each GCTA analysis. For
comparison, we also calculated h2GCTA using accession means for each
trait that were estimated from a linear model that included block and
accession as fixed effects.
To evaluate how less complete genotyping data, that is, marker
density, influences estimates, we compared h2GCTA estimated from
the full data to h2GCTA estimates from 100 data sets of approxi-
mately 5 million, 250 000, 25 000 and 2500 randomly sampled
SNPs using PLINK (Purcell et al. 2007; http://pngu.mgh.harvard.
edu/~purcell/plink/). With 25 000 SNPs, there is on average ~1
SNP/10 kb of reference genome. Linkage disequilibrium, as mea-
sured by r2, is estimated to decay to background levels within
5–10 kb in M. truncatula (Branca et al. 2011). To evaluate how
h2GCTA estimates are affected by sample size (i.e. number of acces-
sions), we estimated heritability on 100 data sets comprised of 50,
100 or 150 randomly sampled accessions. To evaluate how h2GCTA
estimates are affected by sampling only common SNPs, we com-
pared h2GCTA estimated using the 2 178 524 SNPs with MAF
>10% to h2GCTA with an approximately equal number of SNPs
(2 155 724) sampled with MAF >2%. Finally, to evaluate whether
or not sampling putatively causative loci affects h2GCTA estimates,
we estimated the GRM using a pruned data set in which we
removed all SNPs in 10 kb windows flanking each of 1000 SNPs
that a genome-wide association study conducted with these same
data identified as having the strongest statistical association (lowest
P-values) with phenotypic variation in each trait (Stanton-Geddes
et al. 2013). The top 1000 SNPs are certain to contain many false
positives, that is, SNPs that are not actually responsible for pheno-
typic variation, but given that the GWAS was conducted with the
same data, these 1000 SNPs are those that have the strongest asso-
ciation with the phenotypes we analysed. We compared the h2GCTA
estimates using the GRM calculated from the causative SNP
pruned data to a GRM created by randomly masking 1000 10 kb
windows. For estimating the effects of excluding uncommon SNPs,
of sparsely sampling SNPs and not sampling putatively causative
SNPs, we compared h2GCTA estimates to the estimates from the full
data set using accession means. Scripts for statistical analysis avail-
able in Appendix 3.
Results
The relatedness of accessions (off-diagonals of GRM) calcu-
lated by GCTA for the full data ranged from �0�26 to 2�49,with the 95% confidence interval from�0�26 to 0�29 (Fig. S1).This range of relatedness is an order of magnitude greater than
that found for 3925 ‘unrelated’ humans by Yang et al. (2010)
(95%CI from�0�027 to 0�027). The distribution of relatednessis bimodal (Fig. S1), consistent with two major groups found
in previous analyses of population structure (Ronfort et al.
2006; Paape et al. 2013). Unlike Yang et al. (2010), we did not
remove closely related accessions since all accessions were
grown in common conditions, and thus, closely related individ-
uals are no more likely to share common environmental
conditions than distant relatives.
For the six phenotypes we analysed, clonal repeatabilities
(i.e. h2) estimated from replication of the accessions ranged
from near zero for nodule occupancy to 0�71 for flowering date(Fig. 1, Table 1). For each of the traits, h2GCTA estimates using
all 5�6million SNPs and phenotypes from each block individu-
ally spanned the clonal repeatability estimates (Fig. 1). The
mean of the per-block estimates was highly correlated with the
clonal repeatability estimates (r = 0�98, P = 0�0008, Fig. 2).However, estimates of h2GCTA from individual blocks showed
considerable variance around this mean estimate (Fig. 1).
Contrary to expectations that h2GCTA (narrow-sense h2) would
be lower than repeatability (broad-sense h2), all estimates of
repeatability were within the confidence interval for the mean
h2GCTA estimates and the slope of the regression equation did
not differ from one (0�93 � 0�10). The intercept of the equa-
tion did not differ from zero (�0�02 � 0�04) indicating no bias
of h2GCTA compared to clonal repeatability (Fig. 2).
Estimates of h2GCTA conducted with 250 000 or 25 000
SNPs were very similar to those obtained with the full data set
(Fig. 3). However, when only 2500 SNPs were used, h2GCTA
estimates were lower for all traits with non-zero estimates –
from 0�22 to 0�54 lower for flowering date and total nodules,
respectively (Fig. 3). The range of h2GCTA values in the resam-
pled data sets was also considerably larger when fewer SNPs
were used in the estimates (Fig. 3), increasing from 0�07 to 0�11for flowering date and total nodules, respectively.
Although estimates were relatively robust to the number of
SNPs used, they were strongly affected by the number of acces-
sions – with the range of h2GCTA estimates increasing as the
© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society, Methods in Ecology and Evolution, 4, 1151–1158
Genomic estimates of heritability 1153
number of accessions was reduced (Fig. 4). When only 50
accessions were used to estimate heritability, the values of
h2GCTA obtained from the resampled data sets ranged from 0
to 1, the entire range of possible values. Interestingly, the mean
h2GCTA estimates did not show a consistent change with smal-
ler samples. For flowering data, height and trichomes, the
mean h2GCTA estimates increased with decreasing number of
accessions, whereas for the nodulation traits the estimates
decreased with fewer accessions.
Estimates of h2GCTA from only common SNPs
(2 178 524 SNPs with MAF >10%, Table 1) were highly
correlated with h2GCTA with a similar sample size including
uncommon SNPs MAF > 2% (r = 0�99, P < 0�0001). Theestimates based on common SNPs alone were, however,
lower than those obtained from the full data set (average
reduction = 0�18 for the four non-zero traits, and all but 6
of the 52 per-block estimates were lower in the MAF >10%estimates). In contrast, removing 10 kb windows of SNPs
that surround the 1000 SNPs that a previous GWAS identi-
fied as mostly closely associated with phenotypic variance
in the data (Stanton-Geddes et al. 2013) had only minor
effects on h2GCTA estimates (Table 1). This result reinforces
that the GCTA method does not require causative SNPs to
be genotyped for accurate estimates of heritability, as long
as SNP density is high enough to accurately capture
fine-scale relatedness.
Discussion
Heritability is central to predicting the potential for a trait to
respond to selection. Unfortunately, estimating heritability
using traditional breeding or pedigree-based approaches is dif-
ficult and may be not even possible for many organisms (Pla-
tenkamp & Shaw 1995). Using traditional approaches for
estimating heritability is even more challenging if organisms
are grown or reared in natural settings, which may be impor-
tant given that heritability is environmentally dependent. Gen-
ome-scale sequence data provide an opportunity to estimate
relatedness of individuals using molecular data and then using
estimated relatedness to infer heritability from the proportion
of phenotypic variance explained by genotyped SNPs (Yang
et al. 2010). Yang et al. (2010, 2011) evaluated the perfor-
mance of GCTA-based estimates of heritability of several
human traits using hundreds of thousands of SNPs and thou-
sands of individuals. The sample sizes that Yang et al. (2010)
considered are reasonable for researchers analysing human
data but are currently unobtainable in many other species. The
potential for these genomic-based approaches to estimate heri-
tability of non-model species will depend on the characteristics
of the molecular data, and the sample sizes needed to obtain
reliable estimates of heritability.
The good news for evolutionary ecologists interested in esti-
mating heritability of non-model species growing in natural
environments is that estimates of heritability obtained from
GCTA were positively correlated with estimates of clonal
repeatability (Figs 1, 2) and appear relatively robust to SNP
number (Fig. 3) and the exclusion of causative SNPs
(Table 1). However, for any single replicate, the h2GCTA could
be quite far from themean (Fig. 1). For the six traits, we exam-
ined h2GCTA estimates obtained with 25 000 SNPs, approxi-
mately one SNP per 10 kb which is a slightly greater distance
than that which LD decays to background levels (Branca et al.
2011), were similar to h2GCTA estimates obtained from our full
0·00
0·25
0·50
0·75
Flowering date Height Trichomes Total nodules U. nod occ L. nod occ
Trait
Her
itabi
lity
Fig. 1. Plot of estimates of repeatability (r; black diamonds) and esti-
mates of h2GCTA from each block (+ signs) using full sequence data.
Standard errors are not shown for clarity but are reported in Table 1
(repeatability) and supplementalData 3 (h2GCTAfor each block).
Table 1. Estimates of clonal repeatability (r) and h2GCTA for six traits using ~ 2 million SNPs with minor allele frequency (MAF) >2% and >10%andwith 10 kbwindows around the top 1000 SNPsmasked (MAF > 2%). Standard errors for h2GCTA estimates are in parentheses
Trait r
h2 GCTA(S.E.) h2 GCTA (S.E.) h2 GCTA (S.E.)
MAF > 2% MAF > 10% Candidate SNPsmasked
Trichomes 0�35 (0�03) 0�55 (0�22) 0�35 (0�15) 0�54 (0�21)Flowering date 0�71 (0�02) 0�77 (0�14) 0�72 (0�11) 0�78 (0�14)Height 0�52 (0�03) 0�71 (0�17) 0�64 (0�13) 0�70 (0�16)Total nodules 0�31 (0�03) 0�91 (0�16) 0�84 (0�12) 0�92 (0�16)U. nod occ 0�02 (0�03) 0�04 (0�08) 0�09 (0�09) 0�04 (0�08)L. nod occ 0�01 (0�02) 0�01 (0�04) 0�02 (0�03) 0�01 (0�040)
© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society, Methods in Ecology and Evolution, 4, 1151–1158
1154 J. Stanton-Geddes et al.
5�6 million SNP data set. A simulation study similarly found
that only a few thousand markers are adequate to accurately
estimate heritability, particularly in selfing species such as
M. truncatula (Gay, Siol & Ronfort 2013). This result is
encouraging because it means that the number of SNPs needed
to obtain robust estimates of h2 can be assayed using reduced-
representation approaches such as RAD-tag or GBS. These
approaches are both less expensive and faster than full-genome
sequencing, especially for non-model species. However, our
data suggest that estimating heritability using only a few
thousand SNPs, a number that may be assayed using some
SNP-genotyping platforms,may produce unreliable estimates.
Our data also indicate that it is not necessary to assay causa-
tive SNPs or even SNPs that are in close physical linkage with
causative SNPs to obtain reliable estimates of h2. The similarity
of estimates obtained with and without putatively causative
SNPs indicates that with high-density genotyping, allelic varia-
tion to accurately capture relatedness (Goddard & Hayes
2009). Similarly, Ober et al. (2012) found that only 150 000
SNPs were necessary to capture the same predictive ability as
full sequence data in Drosophila, and Yang et al. (2010) were
able to capture about half of the heritability for human height
using only 294k SNPs. Simulation studies also have shown
that inclusion of causative SNPs has little effect on whole-
genome sequence-based phenotype prediction (e.g. genomic
prediction) – Meuwissen & Goddard (2010) showed that
including the causative SNPs yields only about a 3% increase
in prediction accuracy.
While GCTA-based estimates of heritability appear robust
to removal of causative SNPs and relatively robust to SNP
density, h2GCTA estimates are affected by assaying only com-
mon SNPs. We found that h2GCTA estimates that rely only on
common SNPs were lower than those obtained with the full
data set (Table 1). The lower estimates of h2GCTA obtained
when only common SNPs are used to estimate relatedness
likely reflects phenotypic differences among closely related
accessions that are not differentiated when low frequency
SNPs are excluded from the analyses. Lower estimates of h2
when assaying only common SNPs also are consistent with the
initial application of the GCTA method to human height
(Yang et al. 2010); using 250 k common frequency SNPs, they
explained ~ 45% of the phenotypic variance, and by making
assumptions about the MAF of ungenotyped causal SNPs,
they were able to explain the additional 35% to correspond
with independent estimates of the heritability. Despite lower
point estimates of h2GCTA, estimatesmade with common SNPs
(MAF > 10%) were tightly correlated with h2GCTA estimates
made using all SNPs, indicating that simple corrections, such
as used byYang et al. (2010), may be adequate to estimate her-
itability with only common SNPs.Moreover, if researchers are
primarily interested in the relative potential for different traits
to respond to selection, rather than absolute responses to
selection, and then, basing relatedness on only common SNPs
may be valid. From a practical perspective, many now-
commonly used genotyping methods, including RAD-tag and
GBS, do not require ascertainment of SNPs and thus are not
likely to be strongly biased towards common SNPs.
Although SNP density,MAF, and the inclusion of causative
SNPs had relatively small effects on h2 estimates, we found that
estimates were highly dependent on the number of accessions
sampled. When only 50 accessions were used, h2 estimates ran-
ged from 0 to 1 for all six traits (Fig. 4). Even with 100 acces-
y = −0·023 + 0·93 ⋅ x, r2 = 0·949= + ,= += +
0·0
0·2
0·4
0·6
0·8
0·0 0·2 0·4 0·6
Repeatability
H2 G
CTA
Fig. 2. Relationship between the mean h2GCTA from all blocks and
repeatability estimated using all data for each trait. The regression
equation and r2 from the linear model of h2GCTA fit on repeatability are
shown.
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
Flo Date
Height
Trichomes
Tot nodU
nod occL nod occ
2500 25000 250000 5e+06# SNPs
H2 G
CTA
Fig. 3. Box-and-whisker plots of h2GCTA estimates from 100 samples
each made with 226 accessions and 2500, 25 000, 250 000 or 5 million
SNPs for flowering date (FloDate), height, trichome density, total nod-
ules (Tot nod), upper roots rhizobia strain occupancy (U nod occ) and
lower roots rhizobia strain occupancy(L nod occ). The boxes give the
first and third quartiles, while the whiskers extend to the highest value
within 1�5 times the interquartile range.
© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society, Methods in Ecology and Evolution, 4, 1151–1158
Genomic estimates of heritability 1155
sions, the range of values was large –with 50% quantiles of the
h2GCTA estimates spanning more than half of the possible
heritability values for trichome density and nodule number.
This range of values suggests that hundreds of individuals
should be used to obtain reliable estimates of h2. It is worth
noting that the need for large sample sizes is not limited to
GCTA –Villemereuil et al. (2013) showed with simulated data
that even when sampling 200 individuals, confidence intervals
around animal model-based estimates of h2 can be quite large
with 95% quantiles covering a range of ~0�5 for traits with
moderate or high h2 (Villemereuil et al. 2013, Fig. 1).
In addition to the challenge of sampling hundreds of individ-
uals to obtain reliable estimates of h2, genomic-based estimates
of heritability in natural environments may be confounded by
the environment (Garant & Kruuk 2005). This was evident
even with data from plants grown in controlled greenhouse
conditions, with estimates of h2GCTA from individual blocks
different from the overall mean by up to 0�24 (Fig. 1). This
problem will almost certainly be greater in natural settings. In
particular, if relatedness and the environment covary, as may
be expected with species that have limited dispersal, for which
there is a genetic basis to habitat choice (Bazzaz 1991), or for
which maternal environmental effects are strong, then h2GCTA
estimates are likely to be biased upwards. Yang et al. (2010)
suggested that the potential bias of shared environments might
be limited by removing closely related individuals that may
share a common environment. This approach has been criti-
cized for failing to account for fine-scale population structure
or ascertainment bias of samples (Browning & Browning
2011), thoughGoddard et al. (2011) emphasize this is a general
problem for genetic studies when the environment has a large
effect on the phenotype. An alternative approach, if important
environmental variables can be identified and assayed, would
be to directly include the relevant environmental variables as
covariates in themodel estimating heritability.
In summary, to date, the estimation of heritability has
required a known pedigree in nature, or experimentally gener-
ated progeny reared in common environments. We find that
the GCTA method, which has been used to investigate herita-
bility ofmany human traits (Yang et al. 2010, 2011b; Lee et al.
2012), provides estimates of heritability that correspond to
independent estimates of clonal repeatability. These results
suggest that GBS-type approaches, which sample tens to hun-
dreds of thousands of SNPs (e.g. Baird et al. 2008; Andolfatto
et al. 2011; Elshire et al. 2011), will be a valuable resource for
estimating heritability in natural populations when many
hundreds of individuals can be sampled.
Acknowledgements
We thank Jian Yang for help with running the GCTA program, Ruth Shaw for
discussions, Jarrod Hadfield for comments, and Jean-Marie Prosperi, Jo€elle
Ronfort, Magalie Delalande, Thierry Huguet, Laurent Gentzbittel and
Mohammed Badri for maintaining and providing Medicago germplasm. Tim
Paape, Roxanne Denny, Brendan Epstein, Stephanie Erlandson, Masayuki
Sugawara andMohamed Yakub assisted with data collection. This work utilized
computing resources at the University of Minnesota Supercomputing Institute
andwas fundedbyNational Science FoundationGrant 0820005.
Data accessibility
Sequence data are available from: http://www.medicagohapmap.org/downloads/
mt35. Phenotype data are available fromData Dryad doi: 10.5061/dryad.pq143.
Scripts used to perform the analysis are available in onlineAppendices.
References
Andolfatto, P., Davison, D., Erezyilmaz, D., Hu, T.T., Mast, J., Sunayama-Mo-
rita, T. & Stern, D.L. (2011) Multiplexed shotgun genotyping for rapid and
efficient geneticmapping.GenomeResearch, 21, 610–617.Baird, N., Etter, P., Atwood, T. & Currey, M. (2008) Rapid SNP discovery and
geneticmapping using sequencedRADmarkers.PLoSONE, 3, 1–7.Bazzaz, F.A. (1991) Habitat selection in plants. American Naturalist, 137,
S116–S130.Bonnin, I., Ronfort, J., Wozniak, F. & Olivieri, I. (2001) Spatial effects and rare
outcrossingevents in Medicago truncatula (Fabaceae). Molecular Ecology, 10,
1371–1383.Branca, A., Paape, T.D., Zhou, P., Briskine, R., Farmer, A.D., Mudge, J. et al.
(2011) Whole-genome nucleotide diversity, recombination, and linkage dis-
equilibrium in the model legume Medicago truncatula. Proceedings of the
National Academy of Sciences, USA, 108, e867–e867.Browning, S.R. & Browning, B.L. (2011) Population structure can inflate
SNP-based heritability estimates. American Journal of Human Genetics, 89,
191–193.Campos, G. de los, Hickey, J.M., Pong-Wong, R., Daetwyler, H.D. & Calus,
M.P.L. (2012). Whole genome regression and prediction methods applied to
plant and animal breeding.Genetics, 193, 327–345.Coltman, D.W. (2005) Testing marker-based estimates of heritability in the wild.
Molecular Ecology, 14, 2593–2599.Cregan, P., Keyser, H.H. & Sadowsky, M.J. (1989) Host plant effects on nodula-
tion and competitiveness of the Bradyrhizobiumjaponicum serotype strains
constituting serocluster 123. Applied and Environmental Microbiology, 55,
2532–2536.Davies, G., Tenesa, A., Payton, A., Yang, J., Harris, S.E., Liewald, D. et al.
(2011) Genome-wide association studies establish that human intelligence is
highly heritable and polygenic.Molecular Psychiatry, 16, 996–1005.
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
Flo Date
Height
Trichomes
Tot nodU
nod occL nod occ
50 100 150 200# accessions
H2 G
CTA
Fig. 4. Box-and-whisker plots of h2GCTAestimates from 100 samples
each made using all SNPs and with 50, 100, 150 or 200 accessions for
flowering date (Flo Date), height, trichome density, total nodules (Tot
nod), upper roots rhizobia strain occupancy (U nod occ) and lower
roots rhizobia strain occupancy (L nod occ). The boxes give the first
and third quartiles, while the whiskers extend to the highest value
within 1�5 times the interquartile range.
© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society, Methods in Ecology and Evolution, 4, 1151–1158
1156 J. Stanton-Geddes et al.
Elshire, R.J., Glaubitz, J.C., Sun, Q., Poland, J.A., Kawamoto, K., Buckler, E.S.
& Mitchell, S.E. (2011) A robust, simple genotyping-by-sequencing (GBS)
approach for high diversity species.PLoSONE, 6, e19379–e19379.Falconer, D.S. & Mackay, T.F.C. (1996) Introduction to Quantitative Genetics,
4th edn. BenjaminCummings, Essex, England.
Fisher, R.A. (1918) The correlation between relatives on the supposition of
Mendelian inheritance. Transactions of the Royal Society of Edinburgh, 52,
399–433.Frentiu, F.D., Clegg, S.M., Chittock, J., Burke, T., Blows,M.W.&Owens, I.P.F.
(2008) Pedigree-free animal models: the relatedness matrix reloaded. Proceed-
ings of theRoyal Society B: Biological Sciences, 275, 639–647.Garant, D. & Kruuk, L.E.B. (2005) How to use molecular marker data to mea-
sure evolutionary parameters in wild populations. Molecular Ecology, 14,
1843–1859.Gay, L., Siol,M. &Ronfort, J. (2013) Pedigree-free estimates of heritability in the
wild: promising prospects for selfing populations.PLoSONE, 8, e66983.
Geber, M.A. &Griffen, L. (2003) Inheritance and natural selection on functional
traits. International Journal of Plant Sciences, 164, S21–S42.Goddard,M.E. &Hayes, B.J. (2009)Mapping genes for complex traits in domes-
tic animals and their use in breeding programmes. Nature Reviews Genetics,
10, 381–391.Goddard, M.E.M., Wray, N.R.R., Verbyla, K. & Visscher, P.M.P. (2009)
Estimating effects and making predictions from genome-wide marker data.
Statistical Science, 24, 517–529.Goddard, M.E., Lee, S.H., Yang, J., Wray, N.R. & Visscher, P.M. (2011)
Response to Browning and Browning. American Journal of Human Genetics,
89, 193–195.Hill, W.G. (2010) Understanding and using quantitative genetic variation. Philo-
sophical Transactions of theRoyal Society B: Biological Sciences, 365, 73–85.Jannink, J.-L., Lorenz, A.J. & Iwata, H. (2010) Genomic selection in plant breed-
ing: from theory to practice.Briefings in Functional Genomics, 9, 166–177.Lande,R.&Arnold, S.J. (1983) Themeasurementof selection on correlated char-
acters.Evolution, 37, 1210–1226.Lavergne, S., Moquet, N., Ronce, O. & Thuiller, W. (2010) Biodiversity and
climate change: integrating evolutionary and ecological responses of species
and communities. Annual Review of Ecology, Evolution, and Systematics, 41,
321–350.Lee, S.H., DeCandia, T.R., Ripke, S., Yang, J., Sullivan, P.F., Goddard, M.E.,
Keller, M.C., Visscher, P.M. &Wray, N.R. (2012). Estimating the proportion
of variation in susceptibility to schizophrenia captured by commonSNPs.Nat-
ure Genetics, 44, 247–250.Lessells, C.M. & Boag, P.T. (1987) Unrepeatable repeatabilities: a common mis-
take.TheAuk, 104, 116–121.Lynch, M. & Ritland, K. (1999) Estimation of pairwise relatedness with molecu-
larmarkers.Genetics, 152, 1753–1766.Lynch,M. &Walsh, B. (1998) Genetics and Analysis of Quantitative Traits. Sina-
uerAssociates, Sunderland,Massachusetts.
Meuwissen, T. &Goddard,M.E. (2010) Accurate prediction of genetic values for
complex traits bywhole-genome resequencing.Genetics, 185, 623–631.Meuwissen, T.H.E., Hayes, B.J. & Goddard, M.E. (2001) Prediction of total
genetic value using genome-wide dense marker maps. Genetics, 157,
1819–1829.Moose, S.P., Dudley, J.W. & Rocheford, T.R. (2004) Maize selection passes the
century mark: a unique resource for 21st century genomics. Trends in Plant
Science, 9, 358–364.Nakagawa, S. & Schielzeth, H. (2010) Repeatability for Gaussian and
non-Gaussian data: a practical guide for biologists. Biological Reviews, 85,
935–956.Ober, U., Ayroles, J.F., Stone, E.A., Richards, S., Zhu, D., Gibbs, R.A. et al.
(2012) Using whole-genome sequence data to predict quantitative trait pheno-
types inDrosophila melanogaster.PLoSGenetics, 8, e1002685–e1002685.Paape, T., Bataillon, T., Zhou, P., Kono, T.J.Y, Briskine, R., Young, N.D. &
Tiffin, P. (2013). Selection, genome-wide fitness effects and evolutionary rates
in themodel legumeMedicago truncatula.Molecular Ecology, 22, 3525–3538.Pemberton, J.M. (2008) Wild pedigrees: the way forward. Proceedings of the
Royal Society B: Biological Sciences, 275, 613–621.Platenkamp,G.A.J. & Shaw,R.G. (1995). Limits to adaptive population differen-
tiation of quantitative traits in plants. Genecology and Ecogeographic Races
(eds A.R. Kruckeberg, R.B. Walker & A.E. Leviton), pp. 143–167. PacificDivisionAAAS,Ashland, Oregon.
Powell, J.E., Visscher, P.M. & Goddard,M.E. (2010) Reconciling the analysis of
IBD and IBS in complex trait studies.Nature ReviewsGenetics, 11, 800–805.Purcell, S.M., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R.,
Bender, D. et al. (2007) PLINK: a toolset for whole-genome association and
population-based linkage analysis. American Journal of Human Genetics, 81,
559–575.
R Core Team (2013). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. http://www.R-
project.org
Ritland, K. (1996) A marker-based method for inferences about quantitative
inheritance in natural populations.Evolution, 50, 1062–1073.Ritland, K. & Ritland, C. (1996) Inferences about quantitative inheritance based
on natural population structure in the yellowmonkeyflower,Mimulusguttatus.
Evolution, 50, 1074–1082.Robinson, M.R., Santure, A.W., DeCauwer, I., Sheldon, B.C. & Slate, J. (2013)
Partitioning of genetic variation across the genome using multimarker meth-
ods in awild bird population.Molecular Ecology, 22, 3963–3980.Ronfort, J., Bataillon, T., Santoni, S., Delalande, M., David, J.L. & Prosperi,
J.-M. (2006) Microsatellite diversity and broad scale geographic structure
in a model legume: building a set of nested core collection for studying
naturally occurring variation in Medicago truncatula. BMC Plant Biology,
6, 28.
Santure, A.W., De Cauwer, I., Robinson, M.R., Poissant, J., Sheldon, B.C. &
Slate, J. (2013) Genomic dissection of variation in clutch size and egg mass in
a wild great tit (Parus major) population. Molecular Ecology, 22, 3949–3962.
Schielzeth, H. & Nakagawa, S. (2011). rptR: Repeatability for Gaussian and
non-Gaussian data. R package version 0. 6. 404. http://R-Forge.R-project.
org/projects/rptr/
Shaw, R.G. & Etterson, J.R. (2012) Rapid climate change and the rate of adapta-
tion: insight from experimental quantitative genetics. New Phytologist, 195,
752–765.Shaw, R.G., Geyer, C.J., Wagenius, S., Hangelbroek, H.H. & Etterson, J.R.
(2008) Unifying life-history analyses for inference of fitness and population
growth.AmericanNaturalist, 172, E35–E47.Siol, M., Prosperi, J.M., Bonnin, I. & Ronfort, J. (2008) How multilocus geno-
typic pattern helps to understand the history of selfing populations: a case
study inMedicago truncatula.Heredity, 100, 517–525.Stanton-Geddes, J. (2013) Estimation of repeatability for Medicago truncatula
phenotypic data. figshare. doi: 10.6084/m9.figshare.814544.
Stanton-Geddes, J., Paape, T., Epstein, B., Briskine, R., Yoder, J., Mudge, J.
et al. (2013) Candidate genes and genetic architecture of symbiotic and agro-
nomic traits revealed by whole-genome, sequence-based association genetics in
Medicago truncatula.PLoSONE, 8, e65688.
Stanton-Geddes, J., Yoder, J.B., Briskine, R., Young, N.D. & Tiffin, P. (2014)
Data from: Estimating heritability using genomic data. Dryad Digital Reposi-
tory. doi:10.5061/dryad.pq143.
Sugawara, M., Epstein, B., Badgley, B., Unno, T., Xu, L., Reese, J. et al. (2013)
Comparative genomics of the core and accessory genomes of 48 Sinorhizobium
strains comprising five genospecies.GenomeBiology, 14, R17.
VanRaden,P.M. (2008) Efficientmethods to compute genomic predictions. Jour-
nal of Dairy Science, 91, 4414–4423.Verweij, K.J.H., Yang, J., Lahti, J., Veijola, J., Hintsanen, M., Pulkki-R�aback,
L. et al. (2012). Maintenance of genetic variation in human personality: test-
ing evolutionary models by estimating heritability due to common causal
variants and investigating the effect of distant inbreeding. Evolution, 66,
3238–3251.Villemereuil, P., Gimenez, O. & Doligez, B. (2013) Comparing parent-offspring
regressionwith frequentist and Bayesian animal models to estimate heritability
in wild populations: a simulation study for Gaussian and binary traits.Meth-
ods in Ecology and Evolution, 4, 260–275.Weigensberg, I. & Roff, D. (1996) Natural heritabilities: can they be reliably esti-
mated in the laboratory?Evolution, 50, 2149–2157.Wright, S. (1920) The relative importance of heredity and environment in deter-
mining the piebald pattern of guinea-pigs. Proceedings of the National Acad-
emy of Sciences, USA, 6, 320–332.Yang, J., Benyamin, B.,McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R.
et al. (2010) Common SNPs explain a large proportion of the heritability for
human height.Nature Genetics, 42, 565–569.Yang, J., Lee, S.H., Goddard, M.E. & Visscher, P.M. (2011a) GCTA: a tool for
genome-wide complex trait analysis. American Journal of Human Genetics, 88,
76–82.Yang, J.,Manolio, T.A., Pasquale, L.R., Boerwinkle, E., Caporaso,N., Cunning-
ham, J.M. et al. (2011b) Genome partitioning of genetic variation for complex
traits using common SNPs.NatureGenetics, 43, 519–525.Young, N.D., Debell�e, F., Oldroyd, G.E.D., Geurts, R., Cannon, S.B., Udvardi,
M.K., et al. (2011) The Medicago genome provides insight into the evolution
of rhizobial symbioses.Nature, 480, 520–524.
Received 1 July 2013; accepted 19 September 2013
Handling Editor: JarrodHadfield
© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society, Methods in Ecology and Evolution, 4, 1151–1158
Genomic estimates of heritability 1157
Supporting Information
Additional Supporting Information may be found in the online version
of this article.
Fig. S1.Histogram of relatedness values (off-diagonals of genetic relat-
edness values calculated by GCTA) among accessions calculated from
all SNPswithMAF > 10% (left panel) and 2% (right panel).
Appendix S1. Estimation of repeatability for Medicago truncatula
phenotypic data.
Appendix S2. Shell scripts used for GCTA analysis with genotypic data
available fromhttp://www.medicagohapmap.org/downloads/mt35.
Appendix S3.R scripts and data sets for analysis and figures.
© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society, Methods in Ecology and Evolution, 4, 1151–1158
1158 J. Stanton-Geddes et al.