the effect of genetic relationships and other factors on...
TRANSCRIPT
The effect of genetic relationships and other factors on genomic prediction accuracy in public plant breeding programs
Aaron Lorenz NGGIBCI-2014
ICRISAT, Feb 21, 2014
Jon Luetchens Drought phenotyping Corn breeding
Liakat Ali Drought phenotyping Physiology/genetics
Collin Lamkey Heterotic groups Corn breeding/QG
Nonoy Bandillo Time series GWAS GWAS methods
Amritpal Singh Goss’s wilt GWAS/Genome analysis
Dnyaneshwar Kadam Hybrid prediction Corn breeding/ Genomic Prediction
Ibrahim El -Basyoni Winter wheat Genomic prediction
Diego Jarquin Statistics Soybean genomic pred.
Plant breeding in the 21st century Two important trends
$
Genotypic data
Phenotypic data
$
Genomic selection
DNA marker data
Phenotypic data
y Xb + Zu + e
Model training
Predict and select
Selection candidates
Trai
nin
g Po
pu
lati
on
C
alib
rati
on
Set
• No QTL mapping • No testing for
significant markers
Estimation methods for genomic selection
1. Shrinkage models • RR-BLUP, BayesA
2. Dimension reduction methods • Partial least squares • Principal component
regression
3. Variable selection models • BayesB, BayesCπ, BayesDπ
4. Kernel and machine learning methods • Support vector machine
regression
Training population
Line 1 76 1 1 1
Line 2 56 1 1 1
Line 3 45 1 1 1
Line 4 67 0 1 0
Line n 22 1 1 1
Line Yield Mrk 1 Mrk 2 … Mrk p
…
LARGE p !!
smaller n !!
A genome-wide approach typically provides better predictions
Lorenzana and Bernardo (2009) Lorenz (2013)
Ge
no
mic
rA
MAS rA
MAS GS MAS GS
Test varieties and
release
Make crosses and advance generations
Genotype selection
candidates
New Germplasm
Line Development
Cycle
Genomic Selection
Advance lines with highest
GEBV
Phenotype (lines have
already been genotyped)
Train prediction
model
Advance lines informative for
model improvement
Model Training
Cycle
Updated Model
Modified from Heffner, Sorrells, and Jannink 2009. Crop Sci.
Genomic selection in motion
Test varieties and release
Make crosses and advance generations
Genotype selection
candidates
New Germplasm
Line Development
Cycle
Genomic Selection
Advance lines with highest
GEBV
Phenotype (lines have
already been genotyped)
Train prediction
model
Advance lines informative for
model improvement
Model Training
Cycle
Updated Model
Modified from Heffner, Sorrells, and Jannink 2009. Crop Sci.
Which model?
Which lines?
Marker platform? Marker subset?
Genomic selection in motion
Effect of genetic distance between training population and selection candidates on prediction accuracy
Genetic distance between subpopulations
Information sharing decreases with greater genetic distance
1. Epistasis
– Genetic background-by-QTL interactions
2. Differing marker-QTL linkage phases
3. Polymorphic loci not shared
Pop 1 M------Q M------Q m------q m------q
Pop 2 m------Q m------Q M------q M------q
BuschAg University of MN NDSU 6-row
PC 1
PC
2
1180 polymorphic markers
Predicting across subpopulations
Subpop 1 Subpop 2
Validation sets
Trai
nin
g se
ts
Lorenz et al. (2012)
Objectives
1. Examine relationship between prediction accuracy and genetic distance between training population and selection candidates.
2. Devise a method to intelligently sample a training dataset to maximize prediction accuracy.
FHB Genomic Selection Project
UM NDSU Parents (Training pop)
Crosses
U.S. Wheat & Barley Scab Initiative
x
x
x
x
x
x
x
x
x
x
x
x
N= 384 N=384
UM x UM N = 100
UM x ND N = 100
ND x ND N = 100
Progeny (Validation pop)
Genotyping 3072 SNPs 384 SNPs
Kevin Smith UMN
MN ND
Training Population
MN
X
MN
MN
X
ND
ND
X
ND
Valid
ation P
opula
tion
ˆijA
1.5
-1
0
Realized relationship matrix calculated with method of Endelman and Jannink (2012)
1. Order TP by average relatedness to VP.
2. Select TP of 200 lines 3. Sliding window
increments of 10
Sliding window approach
1. Order TP by average relatedness to VP.
2. Select TP of 200 lines 3. Sliding window
increments of 10
1. Order TP by average relatedness to VP.
2. Select TP of 200 lines 3. Sliding window
increments of 10
Can adding increasingly unrelated individuals actually hurt prediction accuracy?
“Next kin” plots
1. Rank TP individuals according to avg relationship with selection candidates
2. Select 10 most closely related individuals and predict,
3. Add next closest 10 and repeat
ˆijA
TP Size
r(p
red
, ob
s)
100% MN
99% - 90%
89% - 80%
79% - 70%
69% - 60%
<60% MN
2
2
2
DON: 0.70
FHB : 0.89
HT: 0.77
Adj R
Adj R
Adj R
TP: MN+ND parents VP: MN x MN prog.
“Next kin” plots
1. Rank TP individuals according to avg relationship with selection candidates
2. Select 10 most closely related individuals and predict,
3. Add next closest 10 and repeat
ˆijA
TP Size
r(p
red
, ob
s)
100% MN
99% - 90%
89% - 80%
79% - 70%
69% - 60%
<60% MN
2
2
2
DON: 0.70
FHB : 0.89
HT: 0.77
Adj R
Adj R
Adj R
TP: MN+ND parents VP: MN x MN prog.
“Next kin” plots
1. Rank TP individuals according to avg relationship with selection candidates
2. Select 10 most closely related individuals and predict,
3. Add next closest 10 and repeat
ˆijA
TP Size
r(p
red
, ob
s)
100% ND
99% - 90%
89% - 80%
79% - 70%
69% - 60%
<60% ND
2
2
2
DON: 0.16
FHB : 0.58
HT: 0.57
Adj R
Adj R
Adj R
TP: MN+ND parents VP: ND x ND prog.
Comparing TP selection schemes
0
0.1
0.2
0.3
0.4
0.5
0.6
DON FHB HT
Random
A_Mean
A_Ind Specific
A_Fam Specific
r (p
red
, ob
s)
TP VP MN+ND MN x MN
Comparing TP selection schemes
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
DON FHB HT
Random
A_Mean
A_Ind Specific
A_Fam Specific
r (p
red
, ob
s)
TP VP MN+ND ND x ND
Can we establish a standard cutoff for inclusion and exclusion of training individuals based on relatedness?
Persistence of LD phase across populations
Correlation of r
M Q M Q
M Q m q
m q m q
M Q M q
m Q m q
M q m Q
M q M q
M q m Q
m Q m Q
Persistence of LD phase across populations
Correlation of r 1. Calculate r between each pair of adjacent markers in each
population • m – 1 r values
2. Correlate the m – 1 r values. • cor(r1, r2)
3. Populations with consistent LD phases between adjacent markers will have high “correlation of r”.
1 2 1 2A A B B
Dr
p p p p
de Roos et al. (2008)
Co
rre
lati
on
of
r
-0.1
0
.0
0.1
0
.2
0.3
0
.4
0.5
-0.4 -0.2 0.0 0.2 0.4
Mean Aij between TP and VP
DON FHB HT
MNxMN
NDxND MNxND
Ind cor of r
TP Size
Calculate cor of r between whole VP and every individual in TP
Future work
• Determine if similar relationships exist in other species/breeding populations
• Continue to validate “individual cor of r“ criteria for designing TP and compare to multi-locus LD measures
Marker platform
GBS vs 92K iSelect assay Hard red winter wheat diversity panel
• Available: 299 lines sampled from winter wheat breeding programs
• Phenotyping
– Two N levels in 2012 and 2013.
– Three reps
– Mead, NE
• Genotyping
– 92K Illumina iSelect assay (Eduard Akhunov)
– Two-enzyme GBS (Jesse Poland)
• 10-fold CV replicated 100 times
Stephen Baenziger UNL
#SNPs
(MAF > 0.05, %NA < 0.50)
iSelect 92K 28,083
GBS 20,021
GBS vs 92K iSelect assay Hard red winter wheat diversity panel
What do we do with all these markers?
Genomic prediction in soybean
• UNL soybean breeding lines – 301 lines
• Traits
• Genotyping-by-sequencing
– Institute of Genomic Diversity, Cornell – 219,035 potential SNPs
• 10-fold cross validation replicated 200 times
Grain yld Plant Ht Maturity Date
Entry-mean h2 0.78 0.79 0.97
George Graef, UNL
Soybean genotyping-by-sequencing
Outside to inside Unique tag count SNP density MAF Percent missing
Katie Hyma, Cornell IGD
SNP Number
#SNPs (MAF > 0.05) %NA < 0.05 16,502 %NA < 0.80 52,349
Results: Average Prediction Accuracy - GY co
r(p
red
icte
d, o
bse
rved
)
cor(
pre
dic
ted
, ob
serv
ed)
MAF MAF
Naïve imputation Random Forest Imputation
Results: Bootstrap Confidence Intervals
• 200 repetitions • 95 % CI
cor(
pre
dic
ted
, ob
serv
ed)
cor(
pre
dic
ted
, ob
serv
ed)
MAF MAF
Naïve imputation Random Forest Imputation
0
0.2
0.4
0.6
0.8
1
DON FHB
Prediction models equal A
ccu
racy
Models also equivalent in: • Bernardo and Yu (2007) [Maize] • Lorenzana and Bernardo (2009) [Several plant species] • Van Raden et al. (2009) [Holstein] • Hayes (2009) [Holstein]
RR-BLUP BayesCpi Bayesian LASSO
Genome-wide modeling epistasis
% of variance
Models rA G G#G X#X E
G 0.589 91.1 -- -- 9.0
G#G 0.585 -- 87.4 -- 12.7
Kaa 0.585 -- -- 87.4 12.6
G + G#G 0.592 65.0 25.0 -- 10.0
G + Kaa 0.588 74.8 -- 15.6 9.7
# = Hadamard product G – additive realized relationship matrix Kaa = additive-by-additive relationship matrix as shown by Xu (2013)
Conclusions
• GBS seems to work well for genomic prediction
• Use all polymorphic markers and impute – Don’t worry about removing markers with high
%NA
• Pay special attention to the genetic distance between selection candidates and the training population. – Ind cor of r simultaneously factors in marker
density and relationships
Thank you www.lorenzlab.net
Acknowledgements Maize silage Natalia de Leon, PI, UW Renato Rodrigues, Postdoc, UW Tim Beissinger, Student, UW Wheat Stephen Baenziger, PI, UNL Ibrahim Salah, Postdoc, UNL Jesse Poland, PI, K-State Eduard Akhunov, PI, K-State Mary Guttierri, Student, UNL Katherine Frels, Student, UNL
Barley Kevin Smith, PI, University of Minnesota Shiaoman Chao, PI, USDA-ARS Vikas Vikram , Student, UMN Jean-Luc Jannink, PI, USDA-ARS Soybean George Graef, PI, UNL Diego Jarquin, Postdoc, UNL Kyle Kocak, Student, UNL Katie Hyma, Cornell IGD Luis Posada, Postdoc, UNL Joey Jedlica, Student, UNL
U.S. Wheat & Barley Scab Initiative