the effect of genetic relationships and other factors on...

The effect of genetic relationships and other factors on genomic prediction accuracy in public plant breeding programs

Aaron Lorenz NGGIBCI-2014

ICRISAT, Feb 21, 2014

Jon Luetchens Drought phenotyping Corn breeding

Liakat Ali Drought phenotyping Physiology/genetics

Collin Lamkey Heterotic groups Corn breeding/QG

Nonoy Bandillo Time series GWAS GWAS methods

Amritpal Singh Goss’s wilt GWAS/Genome analysis

Dnyaneshwar Kadam Hybrid prediction Corn breeding/ Genomic Prediction

Ibrahim El -Basyoni Winter wheat Genomic prediction

Diego Jarquin Statistics Soybean genomic pred.

Plant breeding in the 21st century Two important trends

$

Genotypic data

Phenotypic data

$

Genomic selection

DNA marker data

Phenotypic data

y Xb + Zu + e

Model training

Predict and select

Selection candidates

Trai

nin

g Po

pu

lati

on

C

alib

rati

on

Set

• No QTL mapping • No testing for

significant markers

Estimation methods for genomic selection

1. Shrinkage models • RR-BLUP, BayesA

2. Dimension reduction methods • Partial least squares • Principal component

regression

3. Variable selection models • BayesB, BayesCπ, BayesDπ

4. Kernel and machine learning methods • Support vector machine

regression

Training population

Line 1 76 1 1 1

Line 2 56 1 1 1

Line 3 45 1 1 1

Line 4 67 0 1 0

Line n 22 1 1 1

Line Yield Mrk 1 Mrk 2 … Mrk p

…

LARGE p !!

smaller n !!

A genome-wide approach typically provides better predictions

Lorenzana and Bernardo (2009) Lorenz (2013)

Ge

no

mic

rA

MAS rA

MAS GS MAS GS

Test varieties and

release

Make crosses and advance generations

Genotype selection

candidates

New Germplasm

Line Development

Cycle

Genomic Selection

Advance lines with highest

GEBV

Phenotype (lines have

already been genotyped)

Train prediction

model

Advance lines informative for

model improvement

Model Training

Cycle

Updated Model

Modified from Heffner, Sorrells, and Jannink 2009. Crop Sci.

Genomic selection in motion

Test varieties and release

Make crosses and advance generations

Genotype selection

candidates

New Germplasm

Line Development

Cycle

Genomic Selection

Advance lines with highest

GEBV

Phenotype (lines have

already been genotyped)

Train prediction

model

Advance lines informative for

model improvement

Model Training

Cycle

Updated Model

Modified from Heffner, Sorrells, and Jannink 2009. Crop Sci.

Which model?

Which lines?

Marker platform? Marker subset?

Genomic selection in motion

Effect of genetic distance between training population and selection candidates on prediction accuracy

Genetic distance between subpopulations

Information sharing decreases with greater genetic distance

1. Epistasis

– Genetic background-by-QTL interactions

2. Differing marker-QTL linkage phases

3. Polymorphic loci not shared

Pop 1 M------Q M------Q m------q m------q

Pop 2 m------Q m------Q M------q M------q

BuschAg University of MN NDSU 6-row

PC 1

PC

2

1180 polymorphic markers

Predicting across subpopulations

Subpop 1 Subpop 2

Validation sets

Trai

nin

g se

ts

Lorenz et al. (2012)

Objectives

1. Examine relationship between prediction accuracy and genetic distance between training population and selection candidates.

2. Devise a method to intelligently sample a training dataset to maximize prediction accuracy.

FHB Genomic Selection Project

UM NDSU Parents (Training pop)

Crosses

U.S. Wheat & Barley Scab Initiative

x

x

x

x

x

x

x

x

x

x

x

x

N= 384 N=384

UM x UM N = 100

UM x ND N = 100

ND x ND N = 100

Progeny (Validation pop)

Genotyping 3072 SNPs 384 SNPs

Kevin Smith UMN

http://www.ars.usda.gov/main/main.htm

MN ND

Training Population

MN

X

MN

MN

X

ND

ND

X

ND

Valid

ation P

opula

tion

ˆijA

1.5

-1

0

Realized relationship matrix calculated with method of Endelman and Jannink (2012)

1. Order TP by average relatedness to VP.

2. Select TP of 200 lines 3. Sliding window

increments of 10

Sliding window approach

1. Order TP by average relatedness to VP.

2. Select TP of 200 lines 3. Sliding window

increments of 10

Can adding increasingly unrelated individuals actually hurt prediction accuracy?

“Next kin” plots

1. Rank TP individuals according to avg relationship with selection candidates

2. Select 10 most closely related individuals and predict,

3. Add next closest 10 and repeat

ˆijA

TP Size

r(p

red

, ob

s)

100% MN

99% - 90%

89% - 80%

79% - 70%

69% - 60%

<60% MN

2

2

2

DON: 0.70

FHB : 0.89

HT: 0.77

Adj R

Adj R

Adj R

TP: MN+ND parents VP: MN x MN prog.

“Next kin” plots

1. Rank TP individuals according to avg relationship with selection candidates

2. Select 10 most closely related individuals and predict,

3. Add next closest 10 and repeat

ˆijA

TP Size

r(p

red

, ob

s)

100% ND

99% - 90%

89% - 80%

79% - 70%

69% - 60%

<60% ND

2

2

2

DON: 0.16

FHB : 0.58

HT: 0.57

Adj R

Adj R

Adj R

TP: MN+ND parents VP: ND x ND prog.

Comparing TP selection schemes

0

0.1

0.2

0.3

0.4

0.5

0.6

DON FHB HT

Random

A_Mean

A_Ind Specific

A_Fam Specific

r (p

red

, ob

s)

TP VP MN+ND MN x MN

Comparing TP selection schemes

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

DON FHB HT

Random

A_Mean

A_Ind Specific

A_Fam Specific

r (p

red

, ob

s)

TP VP MN+ND ND x ND

Can we establish a standard cutoff for inclusion and exclusion of training individuals based on relatedness?

Persistence of LD phase across populations

Correlation of r

M Q M Q

M Q m q

m q m q

M Q M q

m Q m q

M q m Q

M q M q

M q m Q

m Q m Q

Persistence of LD phase across populations

Correlation of r 1. Calculate r between each pair of adjacent markers in each

population • m – 1 r values

2. Correlate the m – 1 r values. • cor(r1, r2)

3. Populations with consistent LD phases between adjacent markers will have high “correlation of r”.

1 2 1 2A A B B

Dr

p p p p

de Roos et al. (2008)

Co

rre

lati

on

of

r

-0.1

0

.0

0.1

0

.2

0.3

0

.4

0.5

-0.4 -0.2 0.0 0.2 0.4

Mean Aij between TP and VP

DON FHB HT

MNxMN

NDxND MNxND

Ind cor of r

TP Size

Calculate cor of r between whole VP and every individual in TP

Future work

• Determine if similar relationships exist in other species/breeding populations

• Continue to validate “individual cor of r“ criteria for designing TP and compare to multi-locus LD measures

Marker platform

GBS vs 92K iSelect assay Hard red winter wheat diversity panel

• Available: 299 lines sampled from winter wheat breeding programs

• Phenotyping

– Two N levels in 2012 and 2013.

– Three reps

– Mead, NE

• Genotyping

– 92K Illumina iSelect assay (Eduard Akhunov)

– Two-enzyme GBS (Jesse Poland)

• 10-fold CV replicated 100 times

Stephen Baenziger UNL

#SNPs

(MAF > 0.05, %NA < 0.50)

iSelect 92K 28,083

GBS 20,021

GBS vs 92K iSelect assay Hard red winter wheat diversity panel

What do we do with all these markers?

Genomic prediction in soybean

• UNL soybean breeding lines – 301 lines

• Traits

• Genotyping-by-sequencing

– Institute of Genomic Diversity, Cornell – 219,035 potential SNPs

• 10-fold cross validation replicated 200 times

Grain yld Plant Ht Maturity Date

Entry-mean h2 0.78 0.79 0.97

George Graef, UNL

Soybean genotyping-by-sequencing

Outside to inside Unique tag count SNP density MAF Percent missing

Katie Hyma, Cornell IGD

SNP Number

#SNPs (MAF > 0.05) %NA < 0.05 16,502 %NA < 0.80 52,349

Results: Average Prediction Accuracy - GY co

r(p

red

icte

d, o

bse

rved

)

cor(

pre

dic

ted

, ob

serv

ed)

MAF MAF

Naïve imputation Random Forest Imputation

Results: Bootstrap Confidence Intervals

• 200 repetitions • 95 % CI

cor(

pre

dic

ted

, ob

serv

ed)

cor(

pre

dic

ted

, ob

serv

ed)

MAF MAF

Naïve imputation Random Forest Imputation

0

0.2

0.4

0.6

0.8

1

DON FHB

Prediction models equal A

ccu

racy

Models also equivalent in: • Bernardo and Yu (2007) [Maize] • Lorenzana and Bernardo (2009) [Several plant species] • Van Raden et al. (2009) [Holstein] • Hayes (2009) [Holstein]

RR-BLUP BayesCpi Bayesian LASSO

Genome-wide modeling epistasis

% of variance

Models rA G G#G X#X E

G 0.589 91.1 -- -- 9.0

G#G 0.585 -- 87.4 -- 12.7

Kaa 0.585 -- -- 87.4 12.6

G + G#G 0.592 65.0 25.0 -- 10.0

G + Kaa 0.588 74.8 -- 15.6 9.7

# = Hadamard product G – additive realized relationship matrix Kaa = additive-by-additive relationship matrix as shown by Xu (2013)

Conclusions

• GBS seems to work well for genomic prediction

• Use all polymorphic markers and impute – Don’t worry about removing markers with high

%NA

• Pay special attention to the genetic distance between selection candidates and the training population. – Ind cor of r simultaneously factors in marker

density and relationships

Thank you www.lorenzlab.net

Acknowledgements Maize silage Natalia de Leon, PI, UW Renato Rodrigues, Postdoc, UW Tim Beissinger, Student, UW Wheat Stephen Baenziger, PI, UNL Ibrahim Salah, Postdoc, UNL Jesse Poland, PI, K-State Eduard Akhunov, PI, K-State Mary Guttierri, Student, UNL Katherine Frels, Student, UNL

Barley Kevin Smith, PI, University of Minnesota Shiaoman Chao, PI, USDA-ARS Vikas Vikram , Student, UMN Jean-Luc Jannink, PI, USDA-ARS Soybean George Graef, PI, UNL Diego Jarquin, Postdoc, UNL Kyle Kocak, Student, UNL Katie Hyma, Cornell IGD Luis Posada, Postdoc, UNL Joey Jedlica, Student, UNL

U.S. Wheat & Barley Scab Initiative

the effect of genetic relationships and other factors on...

Documents