estimating heritability using genomic data

Estimating heritability using genomic data

JohnStanton-Geddes1*†, JeremyB. Yoder1, RomanBriskine2, Nevin D. Young1,3 and

Peter Tiffin1*

1Department of Plant Biology, University of Minnesota, Saint Paul, MN 55108, USA; 2Department of Computer Science and

Engineering, University ofMinnesota,Minneapolis, MN 55455, USA; and 3Department of Plant Pathology, University of

Minnesota, Saint Paul, MN55108, USA

Summary

1. Heritability (h2) represents the potential for short-term response of a quantitative trait to selection. Unfortu-

nately, estimating h2 through traditional crossing experiments is not practical for many species, and even for

those in which mating can be manipulated, it may not be possible to assay them in ecologically relevant environ-

ments.

2. We evaluated an approach,GCTA, that uses relatedness estimated from genomic data to estimate the propor-

tion of phenotypic variance due to genotyped SNPs, which can be used to infer h2. Using phenotypic and geno-

typic data from eight replicates of experimentally grown plants of the annual legume Medicago truncatula, we

examined how h2 estimates from GCTA (h2GCTA) related to traditional estimates of heritability (clonal repeat-

ability for these inbred lines). Further, we examined how h2GCTA estimates were affected by SNP number, minor

allele frequency, the number of individuals assayed and the exclusion of causative SNPs.

3. We found that the average h2GCTA estimates for each trait made with the full data set (>5 million SNPs, 200

individuals) were strongly correlated (r = 0�99) with estimates of clonal repeatability. However, this result

masks considerable variation among replicate estimates of h2GCTA, even in relatively uniform greenhouse condi-

tions. h2GCTA estimates with 250 000 and 25 000 SNPs were very similar to those obtained with >5 million

SNPs, but with 2500 SNPs, h2GCTAwere lower and had higher variance than those with ≥25 k SNPs. h2GCTA esti-

mates were slightly lower when only common SNPs were used. Excluding putatively causative SNPs had little

effect on the estimates of h2GCTA, suggesting that genotyping putatively causative SNPs is not necessary to obtain

accurate estimates of h2. The number of accessions sampled had the greatest effect on h2GCTA estimates, and vari-

ance greatly increased as fewer accessions were included. With only 50 accessions sampled, the range of h2GCTA

ranged from 0 to 1 for all traits.

4. These results indicate that the GCTAmethod may be useful for estimating h2 using data sets of a size that are

available from reduced-representation genotyping but that hundreds of individuals may need to be sampled to

obtain robust estimates of h2.

Key-words: Genomic selection, GWAS, heritability, quantitative genetics

Introduction

The vast majority of ecologically important traits are quantita-

tive or complex traits. For these traits, the genetic basis is too

complex to be untangled using traditional molecular genetic

approaches, although genome-wide association mapping is

being used to identify individual genes that contribute to the

variation. Given the inability to map phenotypic variation of

complex traits to the individual molecular determinants, statis-

tical approaches based on relatedness among individuals have

been used to disentangle the relative contributions of genetics

and environment to phenotypic variation (Fisher 1918;

Falconer &Mackay 1996), and predicting expected evolution-

ary responses to selection (Lande & Arnold 1983; Shaw et al.

2008). The effectiveness of formal statistical approaches for

predicting response to selection without explicitly identifying

the molecular mechanisms is demonstrated by the advances

achieved by plant and animal breeders in the past 100 years

(Moose, Dudley&Rocheford 2004).

A central parameter in estimating responses to selection and

summarizing the proportion of variance due to genetics is heri-

tability (h2) (Wright 1920; Falconer &Mackay 1996; Lynch &

Walsh 1998; Hill 2010). Traditionally, h2 has been estimated

through pedigree analyses or using individuals of known rela-

tionships established by experimental crosses. Although rea-

sonable for short-lived species that can be reared in common

environments, experimental crosses are considerably more

challenging for species that are long lived, very large or are not

amenable to controlled mating. In addition, because heritabil-

ity is dependent upon the environment in which organisms are

reared, h2 estimates made under controlled conditions may not

*Correspondence author. E-mails: [email protected]; ptiffin@umn.

edu

†Current address: Department of Biology, University of Vermont, 102

MarshLife Sciences, 109 CarriganDr, Burlington, VT 05405,USA

© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society

Methods in Ecology and Evolution 2013, 4, 1151–1158 doi: 10.1111/2041-210X.12129

be good estimators of h2 in natural conditions (Geber & Grif-

fen 2003), although a survey of estimates from animals sug-

gests that laboratory and field estimates are often quite similar

(Weigensberg&Roff 1996).

Recognizing these limitations, Ritland and colleagues

(Ritland 1996; Ritland & Ritland 1996; Lynch & Ritland

1999) developed methods for estimating h2 in the field. These

methods were based on linear relationships between marker-

based estimates of relatedness and phenotypes. However,

because of uncertainty in estimate of relatedness and con-

founding of relatedness with the environment, h2 estimates

from these methods have not been accurate (Coltman 2005;

Frentiu et al. 2008; Pemberton 2008; Gay, Siol & Ronfort

2013). In recent years, multiple methods have been developed

in the animal breeding literature that use large-scale genomic

data to predict phenotypes (Meuwissen, Hayes & Goddard

2001; Van Raden 2008; Goddard et al. 2009; Campos et al.

2012) and estimate heritability based on the proportion of

phenotypic variance explained by genotyped SNPs. Many of

these methods have focused on phenotype prediction (i.e.

genomic selection or genomic prediction) with the goal of

being able to predict phenotypes in order to speed up breed-

ing cycles and save money required for phenotyping (God-

dard & Hayes 2009; Jannink, Lorenz & Iwata 2010). Yang

et al. (2010, 2011a) modified these methods in the GCTA

software package to estimate the additive genetic variance for

a trait using genome-scale single nucleotide polymorphism

(SNP) data. This method advances Ritland (1996) by estimat-

ing relatedness with many thousands of markers and then

using estimated relatedness to infer the additive genetic vari-

ance (hereafter referred to as h2GCTA) of a trait. If the assayed

SNPs adequately capture the relationships among individuals

at causative alleles, h2GCTA is equivalent to narrow-sense heri-

tability (Yang et al. 2010). Yang et al. (2010) evaluated their

method by estimating the proportion of variance explained

by ~290 k common SNPs in 3925 humans, and after correct-

ing for SNPs that are not genotyped and those at lower minor

allele frequency, come quite close to the heritability of height

estimated from sib models, ~0�8. GCTA also has been used to

estimate heritability for human weight (Yang et al. 2011b),

intelligence (Davies et al. 2011), disease susceptibility (Lee

et al. 2012) and personality (Verweij et al. 2012). In addition,

similar methods have been use to partition genetic variation

in wing length (Robinson et al. 2013), clutch size and egg

mass of a wild bird population (Santure et al. 2013).

Genomic-based estimates of heritability, such as that imple-

mented in GCTA, together with the ability to collect genome-

scale polymorphism data through reduced-representation

genotyping such as RAD-tag, multiplexed shotgun genotyping

or genotype-by-sequencing (GBS) (Baird et al. 2008;

Andolfatto et al. 2011; Elshire et al. 2011), can make precise

estimates of heritability practical even for natural populations

of long-lived non-model species. Such estimates may be valu-

able for understanding evolution in natural populations and

predicting population responses to environmental perturba-

tions including ongoing climate change (Lavergne et al. 2010;

Shaw&Etterson 2012).

In this study, our overall objective was to use full-genome

sequence data and empirical estimates of heritability to evalu-

ate the effects of marker density and minor allele frequency,

sample size and the exclusion of causative SNPs on the perfor-

mance of the GCTA method. Our specific objectives were to

(i) compare h2 estimates obtained from replicated accessions

grown in a common environment with the h2GCTA estimates of

heritability, (ii) evaluate how estimates of h2GCTA are affected

by sample size, SNP density and minor allele frequency and

(iii) examine the effects excluding putatively causal genomic

regions has on h2GCTA estimates. We pursue these objectives

by subsampling a data set of ~6million SNPs identified by gen-

ome sequencing (Branca et al. 2011; Stanton-Geddes et al.

2013) and phenotypic data for six traits (flowering time, height,

trichome density, total nodules and rhizobia strain occupancy

of nodules above and below 5 cm root growth) from 226 acces-

sions of M. truncatula. Our evaluation of the method extends

that of Yang et al. (2010) by empirically examining the effect

of including uncommon SNPs and by evaluating the perfor-

mance of GCTAwith SNP densities and sample sizes that may

be obtained by evolutionary ecologists working in non-model

systems.

Materials andmethods

Medicago truncatula is highly selfing in nature (>95%; Bonnin et al.

2001; Siol et al. 2008) with a native range that extends around theMed-

iterranean (Ronfort et al. 2006). The genomic and phenotypic data

analysed here, the collection of which is described in full in Stanton-

Geddes et al. (2013), were obtained from 226 accessions ofM. trunca-

tula sampled from across a wide portion of the species range. In brief,

the genome of each accession was sequenced using Illumina technology

(90-bp reads, average coverage ~69 per accession) and mapped to the

eight chromosomes and two pseudomolecules (SNPs that could not be

mapped to a chromosome: T – tentative consensus sequences from the

Dana–Farber Cancer InstituteM. truncatula gene index v10.0 andU –

unanchoredBACs) of theM. truncatula v3.5 reference genome (Young

et al. 2011, www.medicagohapmap.org). Sequence data are available

from www.medicagohapmap.org. For analyses, we used 5�67 million

biallelic SNPs that were scored in ≥ 100 accessions.Phenotype data (previously described in Stanton-Geddes et al. 2013)

were obtained from 226 accessions grown in a fully randomized eight-

block greenhouse experiment, with each accession replicated once per

block. BecauseM. truncatula is highly selfing in nature and accessions

were selfed for two or more generations prior to extracting DNA for

sequencing and conducting the experiment, replicates are expected to

be nearly genetically identical. Each plant was inoculated with two

strains of rhizobia (M249 and KH46c) that differ in nodulation pheno-

types (Sugawara et al. 2013).We recorded date of first flower and plant

height at 10 weeks after flowering and counted all nodules on each of

1899 surviving plants (6�1–8 plants per accession depending on the

trait) at 11 weeks after harvesting. For plants from 6 blocks, we calcu-

lated the proportion of nodules occupied by rhizobia strain M249 in

the upper (top 5 cm) and lower roots using a dot-blot assay (Cregan,

Keyser & Sadowsky 1989). Trichome density was measured as the

number of trichomes visible at 109magnification along a 2 mmsection

of the petiole of 1 fully expanded leaf. For each trait, we calculated clo-

nal repeatability using linear mixed-effects (LMM) models imple-

mented in the rptR package (Schielzeth & Nakagawa 2011) in R

© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society, Methods in Ecology and Evolution, 4, 1151–1158

1152 J. Stanton-Geddes et al.

version 2.15 (R Core Team 2013). For height, trichome density and

flowering date we used a Gaussian distribution with MCMC sampling

(rpt.mcmcLMM), while we used aGLMMwith logit link andmul-

tiplicative overdispersion for the nodule number (count data:

rpt.poisGLMM.multi) and nodule occupancy (proportion

data: rpt.binomGLMM.multi) traits. For total nodules andnodule occupancy, repeatability is reported on the transformed (link)

scale. Results did not differ if we used an ANOVA or MCMC approach

(Stanton-Geddes 2013; Appendix 1).Clonal repeatability is a measure

of the among accession phenotypic variance and is equivalent to

broad-sense heritability, h2 (Lessells & Boag 1987; Nakagawa &

Schielzeth 2010). h2 is an upper bound on narrow-sense heritability (h2)

as it also includes effects due to dominance and epistasis (Falconer &

Mackay 1996; Lynch&Walsh 1998).

We used the program GCTA v1.04 (Yang et al. 2011a) to estimate

the proportion of phenotypic variance explained by genotyped SNPs.

The GCTA analysis consists of two steps. First, all SNPs are used to

calculate the genetic relationship matrix (GRM) among accessions.

GCTA uses the accessions included in the analysis as the base popula-

tion for defining relatedness, such that the average relatedness between

all ‘unrelated’ pairs of accessions (off-diagonals of GRM) is zero (Pow-

ell, Visscher & Goddard 2010; Yang et al. 2010). As we are working

with an inbred species, the average relatedness of accessions with them-

selves (diagonals of GRM) is equal to two, not one as for humans

(Yang et al. 2010). TheGRM is then used as a predictor in amixed lin-

ear model with a trait as the response to estimate h2GCTA. The GCTA

method estimates the proportion of additive genetic variance for a trait

and thus narrow-sense heritability and so should be lower than the h2

estimate of clonal repeatability. Scripts for GCTA analysis available in

Appendix 2.

In unmanipulated populations, genotypes likely will be represented

by only a single individual. Therefore, we calculated h2GCTA separately

for plants from each of the experimental blocks, producing eight esti-

mates of heritability for height, trichomes, flowering date, total nodules

and six estimates for the nodule strain occupancy traits. Thus, each

accession is only included a single time in each GCTA analysis. For

comparison, we also calculated h2GCTA using accession means for each

trait that were estimated from a linear model that included block and

accession as fixed effects.

To evaluate how less complete genotyping data, that is, marker

density, influences estimates, we compared h2GCTA estimated from

the full data to h2GCTA estimates from 100 data sets of approxi-

mately 5 million, 250 000, 25 000 and 2500 randomly sampled

SNPs using PLINK (Purcell et al. 2007; http://pngu.mgh.harvard.

edu/~purcell/plink/). With 25 000 SNPs, there is on average ~1

SNP/10 kb of reference genome. Linkage disequilibrium, as mea-

sured by r2, is estimated to decay to background levels within

5–10 kb in M. truncatula (Branca et al. 2011). To evaluate how

h2GCTA estimates are affected by sample size (i.e. number of acces-

sions), we estimated heritability on 100 data sets comprised of 50,

100 or 150 randomly sampled accessions. To evaluate how h2GCTA

estimates are affected by sampling only common SNPs, we com-

pared h2GCTA estimated using the 2 178 524 SNPs with MAF

>10% to h2GCTA with an approximately equal number of SNPs

(2 155 724) sampled with MAF >2%. Finally, to evaluate whether

or not sampling putatively causative loci affects h2GCTA estimates,

we estimated the GRM using a pruned data set in which we

removed all SNPs in 10 kb windows flanking each of 1000 SNPs

that a genome-wide association study conducted with these same

data identified as having the strongest statistical association (lowest

P-values) with phenotypic variation in each trait (Stanton-Geddes

et al. 2013). The top 1000 SNPs are certain to contain many false

positives, that is, SNPs that are not actually responsible for pheno-

typic variation, but given that the GWAS was conducted with the

same data, these 1000 SNPs are those that have the strongest asso-

ciation with the phenotypes we analysed. We compared the h2GCTA

estimates using the GRM calculated from the causative SNP

pruned data to a GRM created by randomly masking 1000 10 kb

windows. For estimating the effects of excluding uncommon SNPs,

of sparsely sampling SNPs and not sampling putatively causative

SNPs, we compared h2GCTA estimates to the estimates from the full

data set using accession means. Scripts for statistical analysis avail-

able in Appendix 3.

Results

The relatedness of accessions (off-diagonals of GRM) calcu-

lated by GCTA for the full data ranged from �0�26 to 2�49,with the 95% confidence interval from�0�26 to 0�29 (Fig. S1).This range of relatedness is an order of magnitude greater than

that found for 3925 ‘unrelated’ humans by Yang et al. (2010)

(95%CI from�0�027 to 0�027). The distribution of relatednessis bimodal (Fig. S1), consistent with two major groups found

in previous analyses of population structure (Ronfort et al.

2006; Paape et al. 2013). Unlike Yang et al. (2010), we did not

remove closely related accessions since all accessions were

grown in common conditions, and thus, closely related individ-

uals are no more likely to share common environmental

conditions than distant relatives.

For the six phenotypes we analysed, clonal repeatabilities

(i.e. h2) estimated from replication of the accessions ranged

from near zero for nodule occupancy to 0�71 for flowering date(Fig. 1, Table 1). For each of the traits, h2GCTA estimates using

all 5�6million SNPs and phenotypes from each block individu-

ally spanned the clonal repeatability estimates (Fig. 1). The

mean of the per-block estimates was highly correlated with the

clonal repeatability estimates (r = 0�98, P = 0�0008, Fig. 2).However, estimates of h2GCTA from individual blocks showed

considerable variance around this mean estimate (Fig. 1).

Contrary to expectations that h2GCTA (narrow-sense h2) would

be lower than repeatability (broad-sense h2), all estimates of

repeatability were within the confidence interval for the mean

h2GCTA estimates and the slope of the regression equation did

not differ from one (0�93 � 0�10). The intercept of the equa-

tion did not differ from zero (�0�02 � 0�04) indicating no bias

of h2GCTA compared to clonal repeatability (Fig. 2).

Estimates of h2GCTA conducted with 250 000 or 25 000

SNPs were very similar to those obtained with the full data set

(Fig. 3). However, when only 2500 SNPs were used, h2GCTA

estimates were lower for all traits with non-zero estimates –

from 0�22 to 0�54 lower for flowering date and total nodules,

respectively (Fig. 3). The range of h2GCTA values in the resam-

pled data sets was also considerably larger when fewer SNPs

were used in the estimates (Fig. 3), increasing from 0�07 to 0�11for flowering date and total nodules, respectively.

Although estimates were relatively robust to the number of

SNPs used, they were strongly affected by the number of acces-

sions – with the range of h2GCTA estimates increasing as the


Genomic estimates of heritability 1153

number of accessions was reduced (Fig. 4). When only 50

accessions were used to estimate heritability, the values of

h2GCTA obtained from the resampled data sets ranged from 0

to 1, the entire range of possible values. Interestingly, the mean

h2GCTA estimates did not show a consistent change with smal-

ler samples. For flowering data, height and trichomes, the

mean h2GCTA estimates increased with decreasing number of

accessions, whereas for the nodulation traits the estimates

decreased with fewer accessions.

Estimates of h2GCTA from only common SNPs

(2 178 524 SNPs with MAF >10%, Table 1) were highly

correlated with h2GCTA with a similar sample size including

uncommon SNPs MAF > 2% (r = 0�99, P < 0�0001). Theestimates based on common SNPs alone were, however,

lower than those obtained from the full data set (average

reduction = 0�18 for the four non-zero traits, and all but 6

of the 52 per-block estimates were lower in the MAF >10%estimates). In contrast, removing 10 kb windows of SNPs

that surround the 1000 SNPs that a previous GWAS identi-

fied as mostly closely associated with phenotypic variance

in the data (Stanton-Geddes et al. 2013) had only minor

effects on h2GCTA estimates (Table 1). This result reinforces

that the GCTA method does not require causative SNPs to

be genotyped for accurate estimates of heritability, as long

as SNP density is high enough to accurately capture

fine-scale relatedness.

Discussion

Heritability is central to predicting the potential for a trait to

respond to selection. Unfortunately, estimating heritability

using traditional breeding or pedigree-based approaches is dif-

ficult and may be not even possible for many organisms (Pla-

tenkamp & Shaw 1995). Using traditional approaches for

estimating heritability is even more challenging if organisms

are grown or reared in natural settings, which may be impor-

tant given that heritability is environmentally dependent. Gen-

ome-scale sequence data provide an opportunity to estimate

relatedness of individuals using molecular data and then using

estimated relatedness to infer heritability from the proportion

of phenotypic variance explained by genotyped SNPs (Yang

et al. 2010). Yang et al. (2010, 2011) evaluated the perfor-

mance of GCTA-based estimates of heritability of several

human traits using hundreds of thousands of SNPs and thou-

sands of individuals. The sample sizes that Yang et al. (2010)

considered are reasonable for researchers analysing human

data but are currently unobtainable in many other species. The

potential for these genomic-based approaches to estimate heri-

tability of non-model species will depend on the characteristics

of the molecular data, and the sample sizes needed to obtain

reliable estimates of heritability.

The good news for evolutionary ecologists interested in esti-

mating heritability of non-model species growing in natural

environments is that estimates of heritability obtained from

GCTA were positively correlated with estimates of clonal

repeatability (Figs 1, 2) and appear relatively robust to SNP

number (Fig. 3) and the exclusion of causative SNPs

(Table 1). However, for any single replicate, the h2GCTA could

be quite far from themean (Fig. 1). For the six traits, we exam-

ined h2GCTA estimates obtained with 25 000 SNPs, approxi-

mately one SNP per 10 kb which is a slightly greater distance

than that which LD decays to background levels (Branca et al.

2011), were similar to h2GCTA estimates obtained from our full

0·00

0·25

0·50

0·75

Flowering date Height Trichomes Total nodules U. nod occ L. nod occ

Trait

Her

itabi

lity

Fig. 1. Plot of estimates of repeatability (r; black diamonds) and esti-

mates of h2GCTA from each block (+ signs) using full sequence data.

Standard errors are not shown for clarity but are reported in Table 1

(repeatability) and supplementalData 3 (h2GCTAfor each block).

Table 1. Estimates of clonal repeatability (r) and h2GCTA for six traits using ~ 2 million SNPs with minor allele frequency (MAF) >2% and >10%andwith 10 kbwindows around the top 1000 SNPsmasked (MAF > 2%). Standard errors for h2GCTA estimates are in parentheses

Trait r

h2 GCTA(S.E.) h2 GCTA (S.E.) h2 GCTA (S.E.)

MAF > 2% MAF > 10% Candidate SNPsmasked

Trichomes 0�35 (0�03) 0�55 (0�22) 0�35 (0�15) 0�54 (0�21)Flowering date 0�71 (0�02) 0�77 (0�14) 0�72 (0�11) 0�78 (0�14)Height 0�52 (0�03) 0�71 (0�17) 0�64 (0�13) 0�70 (0�16)Total nodules 0�31 (0�03) 0�91 (0�16) 0�84 (0�12) 0�92 (0�16)U. nod occ 0�02 (0�03) 0�04 (0�08) 0�09 (0�09) 0�04 (0�08)L. nod occ 0�01 (0�02) 0�01 (0�04) 0�02 (0�03) 0�01 (0�040)



5�6 million SNP data set. A simulation study similarly found

that only a few thousand markers are adequate to accurately

estimate heritability, particularly in selfing species such as

M. truncatula (Gay, Siol & Ronfort 2013). This result is

encouraging because it means that the number of SNPs needed

to obtain robust estimates of h2 can be assayed using reduced-

representation approaches such as RAD-tag or GBS. These

approaches are both less expensive and faster than full-genome

sequencing, especially for non-model species. However, our

data suggest that estimating heritability using only a few

thousand SNPs, a number that may be assayed using some

SNP-genotyping platforms,may produce unreliable estimates.

Our data also indicate that it is not necessary to assay causa-

tive SNPs or even SNPs that are in close physical linkage with

causative SNPs to obtain reliable estimates of h2. The similarity

of estimates obtained with and without putatively causative

SNPs indicates that with high-density genotyping, allelic varia-

tion to accurately capture relatedness (Goddard & Hayes

2009). Similarly, Ober et al. (2012) found that only 150 000

SNPs were necessary to capture the same predictive ability as

full sequence data in Drosophila, and Yang et al. (2010) were

able to capture about half of the heritability for human height

using only 294k SNPs. Simulation studies also have shown

that inclusion of causative SNPs has little effect on whole-

genome sequence-based phenotype prediction (e.g. genomic

prediction) – Meuwissen & Goddard (2010) showed that

including the causative SNPs yields only about a 3% increase

in prediction accuracy.

While GCTA-based estimates of heritability appear robust

to removal of causative SNPs and relatively robust to SNP

density, h2GCTA estimates are affected by assaying only com-

mon SNPs. We found that h2GCTA estimates that rely only on

common SNPs were lower than those obtained with the full

data set (Table 1). The lower estimates of h2GCTA obtained

when only common SNPs are used to estimate relatedness

likely reflects phenotypic differences among closely related

accessions that are not differentiated when low frequency

SNPs are excluded from the analyses. Lower estimates of h2

when assaying only common SNPs also are consistent with the

initial application of the GCTA method to human height

(Yang et al. 2010); using 250 k common frequency SNPs, they

explained ~ 45% of the phenotypic variance, and by making

assumptions about the MAF of ungenotyped causal SNPs,

they were able to explain the additional 35% to correspond

with independent estimates of the heritability. Despite lower

point estimates of h2GCTA, estimatesmade with common SNPs

(MAF > 10%) were tightly correlated with h2GCTA estimates

made using all SNPs, indicating that simple corrections, such

as used byYang et al. (2010), may be adequate to estimate her-

itability with only common SNPs.Moreover, if researchers are

primarily interested in the relative potential for different traits

to respond to selection, rather than absolute responses to

selection, and then, basing relatedness on only common SNPs

may be valid. From a practical perspective, many now-

commonly used genotyping methods, including RAD-tag and

GBS, do not require ascertainment of SNPs and thus are not

likely to be strongly biased towards common SNPs.

Although SNP density,MAF, and the inclusion of causative

SNPs had relatively small effects on h2 estimates, we found that

estimates were highly dependent on the number of accessions

sampled. When only 50 accessions were used, h2 estimates ran-

ged from 0 to 1 for all six traits (Fig. 4). Even with 100 acces-

y = −0·023 + 0·93 ⋅ x, r2 = 0·949= + ,= += +

0·0

0·2

0·4

0·6

0·8

0·0 0·2 0·4 0·6

Repeatability

H2 G

CTA

Fig. 2. Relationship between the mean h2GCTA from all blocks and

repeatability estimated using all data for each trait. The regression

equation and r2 from the linear model of h2GCTA fit on repeatability are

shown.

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

Flo Date

Height

Trichomes

Tot nodU

nod occL nod occ

2500 25000 250000 5e+06# SNPs

H2 G

CTA

Fig. 3. Box-and-whisker plots of h2GCTA estimates from 100 samples

each made with 226 accessions and 2500, 25 000, 250 000 or 5 million

SNPs for flowering date (FloDate), height, trichome density, total nod-

ules (Tot nod), upper roots rhizobia strain occupancy (U nod occ) and

lower roots rhizobia strain occupancy(L nod occ). The boxes give the

first and third quartiles, while the whiskers extend to the highest value

within 1�5 times the interquartile range.



sions, the range of values was large –with 50% quantiles of the

h2GCTA estimates spanning more than half of the possible

heritability values for trichome density and nodule number.

This range of values suggests that hundreds of individuals

should be used to obtain reliable estimates of h2. It is worth

noting that the need for large sample sizes is not limited to

GCTA –Villemereuil et al. (2013) showed with simulated data

that even when sampling 200 individuals, confidence intervals

around animal model-based estimates of h2 can be quite large

with 95% quantiles covering a range of ~0�5 for traits with

moderate or high h2 (Villemereuil et al. 2013, Fig. 1).

In addition to the challenge of sampling hundreds of individ-

uals to obtain reliable estimates of h2, genomic-based estimates

of heritability in natural environments may be confounded by

the environment (Garant & Kruuk 2005). This was evident

even with data from plants grown in controlled greenhouse

conditions, with estimates of h2GCTA from individual blocks

different from the overall mean by up to 0�24 (Fig. 1). This

problem will almost certainly be greater in natural settings. In

particular, if relatedness and the environment covary, as may

be expected with species that have limited dispersal, for which

there is a genetic basis to habitat choice (Bazzaz 1991), or for

which maternal environmental effects are strong, then h2GCTA

estimates are likely to be biased upwards. Yang et al. (2010)

suggested that the potential bias of shared environments might

be limited by removing closely related individuals that may

share a common environment. This approach has been criti-

cized for failing to account for fine-scale population structure

or ascertainment bias of samples (Browning & Browning

2011), thoughGoddard et al. (2011) emphasize this is a general

problem for genetic studies when the environment has a large

effect on the phenotype. An alternative approach, if important

environmental variables can be identified and assayed, would

be to directly include the relevant environmental variables as

covariates in themodel estimating heritability.

In summary, to date, the estimation of heritability has

required a known pedigree in nature, or experimentally gener-

ated progeny reared in common environments. We find that

the GCTA method, which has been used to investigate herita-

bility ofmany human traits (Yang et al. 2010, 2011b; Lee et al.

2012), provides estimates of heritability that correspond to

independent estimates of clonal repeatability. These results

suggest that GBS-type approaches, which sample tens to hun-

dreds of thousands of SNPs (e.g. Baird et al. 2008; Andolfatto

et al. 2011; Elshire et al. 2011), will be a valuable resource for

estimating heritability in natural populations when many

hundreds of individuals can be sampled.

Acknowledgements

We thank Jian Yang for help with running the GCTA program, Ruth Shaw for

discussions, Jarrod Hadfield for comments, and Jean-Marie Prosperi, Jo€elle

Ronfort, Magalie Delalande, Thierry Huguet, Laurent Gentzbittel and

Mohammed Badri for maintaining and providing Medicago germplasm. Tim

Paape, Roxanne Denny, Brendan Epstein, Stephanie Erlandson, Masayuki

Sugawara andMohamed Yakub assisted with data collection. This work utilized

computing resources at the University of Minnesota Supercomputing Institute

andwas fundedbyNational Science FoundationGrant 0820005.

Data accessibility

Sequence data are available from: http://www.medicagohapmap.org/downloads/

mt35. Phenotype data are available fromData Dryad doi: 10.5061/dryad.pq143.

Scripts used to perform the analysis are available in onlineAppendices.

References

Andolfatto, P., Davison, D., Erezyilmaz, D., Hu, T.T., Mast, J., Sunayama-Mo-

rita, T. & Stern, D.L. (2011) Multiplexed shotgun genotyping for rapid and

efficient geneticmapping.GenomeResearch, 21, 610–617.Baird, N., Etter, P., Atwood, T. & Currey, M. (2008) Rapid SNP discovery and

geneticmapping using sequencedRADmarkers.PLoSONE, 3, 1–7.Bazzaz, F.A. (1991) Habitat selection in plants. American Naturalist, 137,

S116–S130.Bonnin, I., Ronfort, J., Wozniak, F. & Olivieri, I. (2001) Spatial effects and rare

outcrossingevents in Medicago truncatula (Fabaceae). Molecular Ecology, 10,

1371–1383.Branca, A., Paape, T.D., Zhou, P., Briskine, R., Farmer, A.D., Mudge, J. et al.

(2011) Whole-genome nucleotide diversity, recombination, and linkage dis-

equilibrium in the model legume Medicago truncatula. Proceedings of the

National Academy of Sciences, USA, 108, e867–e867.Browning, S.R. & Browning, B.L. (2011) Population structure can inflate

SNP-based heritability estimates. American Journal of Human Genetics, 89,

191–193.Campos, G. de los, Hickey, J.M., Pong-Wong, R., Daetwyler, H.D. & Calus,

M.P.L. (2012). Whole genome regression and prediction methods applied to

plant and animal breeding.Genetics, 193, 327–345.Coltman, D.W. (2005) Testing marker-based estimates of heritability in the wild.

Molecular Ecology, 14, 2593–2599.Cregan, P., Keyser, H.H. & Sadowsky, M.J. (1989) Host plant effects on nodula-

tion and competitiveness of the Bradyrhizobiumjaponicum serotype strains

constituting serocluster 123. Applied and Environmental Microbiology, 55,

2532–2536.Davies, G., Tenesa, A., Payton, A., Yang, J., Harris, S.E., Liewald, D. et al.

(2011) Genome-wide association studies establish that human intelligence is

highly heritable and polygenic.Molecular Psychiatry, 16, 996–1005.

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

Flo Date

Height

Trichomes

Tot nodU

nod occL nod occ

50 100 150 200# accessions

H2 G

CTA

Fig. 4. Box-and-whisker plots of h2GCTAestimates from 100 samples

each made using all SNPs and with 50, 100, 150 or 200 accessions for

flowering date (Flo Date), height, trichome density, total nodules (Tot

nod), upper roots rhizobia strain occupancy (U nod occ) and lower

roots rhizobia strain occupancy (L nod occ). The boxes give the first

and third quartiles, while the whiskers extend to the highest value

within 1�5 times the interquartile range.



Elshire, R.J., Glaubitz, J.C., Sun, Q., Poland, J.A., Kawamoto, K., Buckler, E.S.

& Mitchell, S.E. (2011) A robust, simple genotyping-by-sequencing (GBS)

approach for high diversity species.PLoSONE, 6, e19379–e19379.Falconer, D.S. & Mackay, T.F.C. (1996) Introduction to Quantitative Genetics,

4th edn. BenjaminCummings, Essex, England.

Fisher, R.A. (1918) The correlation between relatives on the supposition of

Mendelian inheritance. Transactions of the Royal Society of Edinburgh, 52,

399–433.Frentiu, F.D., Clegg, S.M., Chittock, J., Burke, T., Blows,M.W.&Owens, I.P.F.

(2008) Pedigree-free animal models: the relatedness matrix reloaded. Proceed-

ings of theRoyal Society B: Biological Sciences, 275, 639–647.Garant, D. & Kruuk, L.E.B. (2005) How to use molecular marker data to mea-

sure evolutionary parameters in wild populations. Molecular Ecology, 14,

1843–1859.Gay, L., Siol,M. &Ronfort, J. (2013) Pedigree-free estimates of heritability in the

wild: promising prospects for selfing populations.PLoSONE, 8, e66983.

Geber, M.A. &Griffen, L. (2003) Inheritance and natural selection on functional

traits. International Journal of Plant Sciences, 164, S21–S42.Goddard,M.E. &Hayes, B.J. (2009)Mapping genes for complex traits in domes-

tic animals and their use in breeding programmes. Nature Reviews Genetics,

10, 381–391.Goddard, M.E.M., Wray, N.R.R., Verbyla, K. & Visscher, P.M.P. (2009)

Estimating effects and making predictions from genome-wide marker data.

Statistical Science, 24, 517–529.Goddard, M.E., Lee, S.H., Yang, J., Wray, N.R. & Visscher, P.M. (2011)

Response to Browning and Browning. American Journal of Human Genetics,

89, 193–195.Hill, W.G. (2010) Understanding and using quantitative genetic variation. Philo-

sophical Transactions of theRoyal Society B: Biological Sciences, 365, 73–85.Jannink, J.-L., Lorenz, A.J. & Iwata, H. (2010) Genomic selection in plant breed-

ing: from theory to practice.Briefings in Functional Genomics, 9, 166–177.Lande,R.&Arnold, S.J. (1983) Themeasurementof selection on correlated char-

acters.Evolution, 37, 1210–1226.Lavergne, S., Moquet, N., Ronce, O. & Thuiller, W. (2010) Biodiversity and

climate change: integrating evolutionary and ecological responses of species

and communities. Annual Review of Ecology, Evolution, and Systematics, 41,

321–350.Lee, S.H., DeCandia, T.R., Ripke, S., Yang, J., Sullivan, P.F., Goddard, M.E.,

Keller, M.C., Visscher, P.M. &Wray, N.R. (2012). Estimating the proportion

of variation in susceptibility to schizophrenia captured by commonSNPs.Nat-

ure Genetics, 44, 247–250.Lessells, C.M. & Boag, P.T. (1987) Unrepeatable repeatabilities: a common mis-

take.TheAuk, 104, 116–121.Lynch, M. & Ritland, K. (1999) Estimation of pairwise relatedness with molecu-

larmarkers.Genetics, 152, 1753–1766.Lynch,M. &Walsh, B. (1998) Genetics and Analysis of Quantitative Traits. Sina-

uerAssociates, Sunderland,Massachusetts.

Meuwissen, T. &Goddard,M.E. (2010) Accurate prediction of genetic values for

complex traits bywhole-genome resequencing.Genetics, 185, 623–631.Meuwissen, T.H.E., Hayes, B.J. & Goddard, M.E. (2001) Prediction of total

genetic value using genome-wide dense marker maps. Genetics, 157,

1819–1829.Moose, S.P., Dudley, J.W. & Rocheford, T.R. (2004) Maize selection passes the

century mark: a unique resource for 21st century genomics. Trends in Plant

Science, 9, 358–364.Nakagawa, S. & Schielzeth, H. (2010) Repeatability for Gaussian and

non-Gaussian data: a practical guide for biologists. Biological Reviews, 85,

935–956.Ober, U., Ayroles, J.F., Stone, E.A., Richards, S., Zhu, D., Gibbs, R.A. et al.

(2012) Using whole-genome sequence data to predict quantitative trait pheno-

types inDrosophila melanogaster.PLoSGenetics, 8, e1002685–e1002685.Paape, T., Bataillon, T., Zhou, P., Kono, T.J.Y, Briskine, R., Young, N.D. &

Tiffin, P. (2013). Selection, genome-wide fitness effects and evolutionary rates

in themodel legumeMedicago truncatula.Molecular Ecology, 22, 3525–3538.Pemberton, J.M. (2008) Wild pedigrees: the way forward. Proceedings of the

Royal Society B: Biological Sciences, 275, 613–621.Platenkamp,G.A.J. & Shaw,R.G. (1995). Limits to adaptive population differen-

tiation of quantitative traits in plants. Genecology and Ecogeographic Races

(eds A.R. Kruckeberg, R.B. Walker & A.E. Leviton), pp. 143–167. PacificDivisionAAAS,Ashland, Oregon.

Powell, J.E., Visscher, P.M. & Goddard,M.E. (2010) Reconciling the analysis of

IBD and IBS in complex trait studies.Nature ReviewsGenetics, 11, 800–805.Purcell, S.M., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R.,

Bender, D. et al. (2007) PLINK: a toolset for whole-genome association and

population-based linkage analysis. American Journal of Human Genetics, 81,

559–575.

R Core Team (2013). R: A Language and Environment for Statistical Computing.

R Foundation for Statistical Computing, Vienna, Austria. http://www.R-

project.org

Ritland, K. (1996) A marker-based method for inferences about quantitative

inheritance in natural populations.Evolution, 50, 1062–1073.Ritland, K. & Ritland, C. (1996) Inferences about quantitative inheritance based

on natural population structure in the yellowmonkeyflower,Mimulusguttatus.

Evolution, 50, 1074–1082.Robinson, M.R., Santure, A.W., DeCauwer, I., Sheldon, B.C. & Slate, J. (2013)

Partitioning of genetic variation across the genome using multimarker meth-

ods in awild bird population.Molecular Ecology, 22, 3963–3980.Ronfort, J., Bataillon, T., Santoni, S., Delalande, M., David, J.L. & Prosperi,

J.-M. (2006) Microsatellite diversity and broad scale geographic structure

in a model legume: building a set of nested core collection for studying

naturally occurring variation in Medicago truncatula. BMC Plant Biology,

6, 28.

Santure, A.W., De Cauwer, I., Robinson, M.R., Poissant, J., Sheldon, B.C. &

Slate, J. (2013) Genomic dissection of variation in clutch size and egg mass in

a wild great tit (Parus major) population. Molecular Ecology, 22, 3949–3962.

Schielzeth, H. & Nakagawa, S. (2011). rptR: Repeatability for Gaussian and

non-Gaussian data. R package version 0. 6. 404. http://R-Forge.R-project.

org/projects/rptr/

Shaw, R.G. & Etterson, J.R. (2012) Rapid climate change and the rate of adapta-

tion: insight from experimental quantitative genetics. New Phytologist, 195,

752–765.Shaw, R.G., Geyer, C.J., Wagenius, S., Hangelbroek, H.H. & Etterson, J.R.

(2008) Unifying life-history analyses for inference of fitness and population

growth.AmericanNaturalist, 172, E35–E47.Siol, M., Prosperi, J.M., Bonnin, I. & Ronfort, J. (2008) How multilocus geno-

typic pattern helps to understand the history of selfing populations: a case

study inMedicago truncatula.Heredity, 100, 517–525.Stanton-Geddes, J. (2013) Estimation of repeatability for Medicago truncatula

phenotypic data. figshare. doi: 10.6084/m9.figshare.814544.

Stanton-Geddes, J., Paape, T., Epstein, B., Briskine, R., Yoder, J., Mudge, J.

et al. (2013) Candidate genes and genetic architecture of symbiotic and agro-

nomic traits revealed by whole-genome, sequence-based association genetics in

Medicago truncatula.PLoSONE, 8, e65688.

Stanton-Geddes, J., Yoder, J.B., Briskine, R., Young, N.D. & Tiffin, P. (2014)

Data from: Estimating heritability using genomic data. Dryad Digital Reposi-

tory. doi:10.5061/dryad.pq143.

Sugawara, M., Epstein, B., Badgley, B., Unno, T., Xu, L., Reese, J. et al. (2013)

Comparative genomics of the core and accessory genomes of 48 Sinorhizobium

strains comprising five genospecies.GenomeBiology, 14, R17.

VanRaden,P.M. (2008) Efficientmethods to compute genomic predictions. Jour-

nal of Dairy Science, 91, 4414–4423.Verweij, K.J.H., Yang, J., Lahti, J., Veijola, J., Hintsanen, M., Pulkki-R�aback,

L. et al. (2012). Maintenance of genetic variation in human personality: test-

ing evolutionary models by estimating heritability due to common causal

variants and investigating the effect of distant inbreeding. Evolution, 66,

3238–3251.Villemereuil, P., Gimenez, O. & Doligez, B. (2013) Comparing parent-offspring

regressionwith frequentist and Bayesian animal models to estimate heritability

in wild populations: a simulation study for Gaussian and binary traits.Meth-

ods in Ecology and Evolution, 4, 260–275.Weigensberg, I. & Roff, D. (1996) Natural heritabilities: can they be reliably esti-

mated in the laboratory?Evolution, 50, 2149–2157.Wright, S. (1920) The relative importance of heredity and environment in deter-

mining the piebald pattern of guinea-pigs. Proceedings of the National Acad-

emy of Sciences, USA, 6, 320–332.Yang, J., Benyamin, B.,McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R.

et al. (2010) Common SNPs explain a large proportion of the heritability for

human height.Nature Genetics, 42, 565–569.Yang, J., Lee, S.H., Goddard, M.E. & Visscher, P.M. (2011a) GCTA: a tool for

genome-wide complex trait analysis. American Journal of Human Genetics, 88,

76–82.Yang, J.,Manolio, T.A., Pasquale, L.R., Boerwinkle, E., Caporaso,N., Cunning-

ham, J.M. et al. (2011b) Genome partitioning of genetic variation for complex

traits using common SNPs.NatureGenetics, 43, 519–525.Young, N.D., Debell�e, F., Oldroyd, G.E.D., Geurts, R., Cannon, S.B., Udvardi,

M.K., et al. (2011) The Medicago genome provides insight into the evolution

of rhizobial symbioses.Nature, 480, 520–524.

Received 1 July 2013; accepted 19 September 2013

Handling Editor: JarrodHadfield



Supporting Information

Additional Supporting Information may be found in the online version

of this article.

Fig. S1.Histogram of relatedness values (off-diagonals of genetic relat-

edness values calculated by GCTA) among accessions calculated from

all SNPswithMAF > 10% (left panel) and 2% (right panel).

Appendix S1. Estimation of repeatability for Medicago truncatula

phenotypic data.

Appendix S2. Shell scripts used for GCTA analysis with genotypic data

available fromhttp://www.medicagohapmap.org/downloads/mt35.

Appendix S3.R scripts and data sets for analysis and figures.



estimating heritability using genomic data

Documents