a tour of solving a ridge regression model€¦ · gauss-seidel with residual update 1) setup...

48
A tour of solving a Ridge regression model Gregor Gorjanc, John M. Hickey www.alphagenes.roslin.ed.ac.uk @GregorGorjanc

Upload: others

Post on 24-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

A tour of solving a Ridge regression model

Gregor Gorjanc, John M. Hickey

www.alphagenes.roslin.ed.ac.uk @GregorGorjanc

Page 2: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

The plan

Data example

Direct/exact solve

Iterative solve via Gauss-Seidel

Monte Carlo Markov Chain

Page 3: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

A small example

Locus

Individual 1 2 3 4

A A/A B/B A/B A/A

B A/A B/B A/A A/A

C A/B B/B B/B B/B

D B/B A/B A/A A/A

Page 4: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Allele dosages

Locus

Individual 1 2 3 4

A 0 2 1 0

B 0 2 0 0

C 1 2 2 2

D 2 1 0 0

Page 5: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Genes and markers

Lets pick locus 1 as a gene (=causal locus) and loci 2 and 3 as markers

Locus

Individual 1 2 3 4

A 0 2 1 0

B 0 2 0 0

C 1 2 2 2

D 2 1 0 0

Page 6: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Simulate phenotypes

Quantitative genetic model

P = Mean + G + E + G×E = Mean + (A + D + I) + E + (…) ×E P ≈ Mean + A + E

Page 7: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Simulate phenotypes

•  Population mean 10 units (mean of reference genotype, A/A)

•  Allele substitution effect 2 units (a change of mean when substituting allele A for B)

•  Breeding value = Allele sub. effect * Allele dosage

•  True phenotype = Pop. mean + Breeding value

Page 8: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Simulate phenotypes

Individual Gene Population mean

Breeding value

True phenotype

A 0 10 0×2=0 10

B 0 10 0×2=0 10

C 1 10 1×2=2 12

D 2 10 2×2=4 14

Page 9: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Simulate phenotype

•  Observed phenotype = True phenotype + Noise •  Sample noise from Gaussian distribution

Noise ~ Normal(0,Ve)

•  How much noise? •  Target h2 of 0.3, h2 = Va/(Va+Ve)

•  Work out Ve if Va = 3.67 units2

Page 10: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Simulate phenotype

•  Observed phenotype = True phenotype + Noise •  Sample noise from Gaussian distribution

Noise ~ Normal(0,Ve)

•  How much noise? •  Target h2 of 0.3, h2 = Va/(Va+Ve)

•  Work out Ve if Va = 3.67 units2

Ve=Va(1-h2)/h2=3.67(1-0.3)/0.3=8.56 units2

Page 11: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Simulate phenotypes

Individual Gene True phenotype

Noise Observed phenotype

A 0 10 2.3 12.3

B 0 10 -5.9 4.1

C 1 12 3.6 15.6

D 2 14 5.0 19.0

We will work with the true phenotype so that we all get the same solutions

Page 12: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

R

•  File 01_Data.R

•  Run the code (step by step)

•  Which marker should capture the effect of gene the most?

•  Will the estimated marker effect be positive or negative?

Page 13: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

R

•  Marker 2 has the strongest correlation with the gene

•  Gene effect is positive, but correlation is negative, so marker effect estimate will likely be negative.

Page 14: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Summary

Data example

Direct/exact solve

Iterative solve via Gauss-Seidel

Monte Carlo Markov Chain

Page 15: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

The tasks

1)  Setup the model

2)  Estimate the model parameters (=solve the system of equations)

3)  Estimate/predict (genomic) breeding values

4)  Evaluate accuracy of breeding values (in the training set!!!)

Page 16: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Using R’s lm() function

•  File 02_Estimate_lm.R

•  Use R functions

NOTE: this is not a ridge regression model – just a linear model without any shrinkage/penalization

Page 17: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Using R’s lm() function > summary(LmFit)Call:lm(formula = Phen ~ 1 + Geno[, Cols])

Residuals: 1 2 3 4 -6.667e-01 3.333e-01 3.333e-01 2.776e-17

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.3333 1.7951 10.213 0.0621 .Geno[, Cols]1 -4.3333 1.1055 -3.920 0.1590 Geno[, Cols]2 1.0000 0.5774 1.732 0.3333 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8165 on 1 degrees of freedomMultiple R-squared: 0.9394, Adjusted R-squared: 0.8182 F-statistic: 7.75 on 2 and 1 DF, p-value: 0.2462

Page 18: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Do it yourself

•  File 03_Estimate_Direct_Solve.R

•  Model

Page 19: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Do it yourself

•  File 03_Estimate_Direct_Solve.R

•  Model

Page 20: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Do it yourself

•  System of equations

Page 21: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Do it yourself

•  Solve the system •  Predict phenotype •  Standard errors

Page 22: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Do it yourself

•  Solutions •  Standard errors

•  Predictions

•  Accuracy (in training!!!!!!!!!!)

Page 23: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Ridge regression

•  File 04_Estimate_Direct_Solve_Prior.R

•  Assume that we know variance components – Vm=Va/nMarkers = 3.67/2 = 1.83 – Ve=8.56

Page 24: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Ridge regression - system

Page 25: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Results

•  Solutions

•  Standard errors

•  Predictions

Page 26: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Summary

Data example

Direct/exact solve

Iterative solve via Gauss-Seidel

Monte Carlo Markov Chain

Page 27: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Direct vs. iterative methods

•  Direct methods – PRO: get estimates (=cond. means) and

variance of estimates (=cond. variances) – CON: can be expensive to solve for big datasets

•  Iterative methods – PRO: can be solved for VERY large systems – CON: get only estimates

(NOTE: can easily extend to get variance of estimates as well as other stuff à full Bayesian analysis via MCMC)

Page 28: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Gauss-Seidel with residual update

1)  Setup diagonal of 2)  Define working vector 3)  Initialize solutions 4)  Iterate until convergence –  Iterate over parameters

1)  Add to working vector 2)  Setup LHS diagonal element 3)  Setup RHS element 4)  Estimate 5)  Remove from working vector

Page 29: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

GSRU in R

XpX <- colSums(X*X)E <- PhenSol <- rep(0, times=nCov)Iter <- 2while (Iter <= nIter) { Eps <- 0 CovOrder <- sample(x=1:nCov) for(j in CovOrder) { E <- E + X[, j]*Sol[j] LHS <- XpX[j] RHS <- sum(X[, j]*E) New <- RHS/LHS E <- E - X[, j]*New Eps <- Eps + abs(New - Sol[j]) Sol[j] <- New } Iter <- Iter + 1 if (Eps < 1e-8) break}

•  File 05_Estimate_GSRU.R

Page 30: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Convergence

Page 31: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

GSRU for ridge regression

1)  Setup diagonal of 2)  Define working vector 3)  Initialize solutions 4)  Iterate until convergence –  Iterate over parameters

1)  Add to working vector 2)  Setup LHS diagonal element 3)  Setup RHS element 4)  Estimate 5)  Remove from working vector

File 06_Estimate_GSRU_Prior.R

Page 32: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Convergence

Regularization Shrinkage in action!!! Penalization

Page 33: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Summary

Data example

Direct/exact solve

Iterative solve via Gauss-Seidel

Monte Carlo Markov Chain

Page 34: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Monte Carlo Markov Chain

•  A method to obtain numerical approximation of multi-dimensional integrals – Monte Carlo = sampling from distributions

– Markov Chain = current value depends only on the previous value, but not the value before the previous value

•  Very useful tool in Bayesian analysis

Page 35: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

MCMC for ridge regression

•  Not needed in this case!!! •  Model (means unknown, variances known)

•  Posterior (we know how to solve this)

Page 36: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

MCMC for ridge regression

•  File 07_Estimate_MCMC.R

•  Lets test if we will get the same standard errors as with the direct solve

•  Its easy to implement on top of GSRU

Page 37: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Traceplot

Page 38: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Traceplot

Page 39: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Summarize the chains

•  Posterior means

•  Posterior standard deviations

Page 40: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

MCMC for FULL ridge regression

•  Needed in this case!!! •  Model (means and variances unknown)

•  Posterior (tricky …)

Page 41: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

MCMC for FULL ridge regression

•  File 08_Estimate_MCMC_Including_Variances.R

•  As before just sample variance components in addition to the means (intercept and two marker effects)

Page 42: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Traceplot 1

Page 43: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Traceplot 2

Page 44: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Summarize the chains

Mean SD Mean SD

Ve 3.5 2.8 Intercept 14.4 3.0

Vm 4.4 4.4 b1 -1.8 1.7

Va 6.7 6.7 b2 0.3 1.0

h2 0.63 0.19

Page 45: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Summary

Data example

Direct/exact solve

Iterative solve via Gauss-Seidel

Monte Carlo Markov Chain

Page 46: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Acknowledgements

•  Roslin –  Gregor Gorjanc –  Janez Jenko –  Mara Battagin –  Stefan Edwards –  Serap Gonen –  Chris Gaynor –  Anne-Michelle Faux –  Roberto Antolin –  John Woolliams –  Bruce Whitelaw

•  Further information www.alphagenes.roslin.ed.ac.uk

@GregorGorjanc @HickeyJohn

[email protected] [email protected]

•  Vacancies –  Two post-doc positions

currently available

•  NIAB –  Ian Mackay –  Alison Bentley

•  Genus –  Alan Mileham –  Matthew Cleveland –  William Herring

•  Aviagen –  Andreas Kranis –  Kellie Watson

Page 47: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

Funding

Page 48: A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence

A tour of solving a Ridge regression model

Gregor Gorjanc, John M. Hickey

www.alphagenes.roslin.ed.ac.uk @GregorGorjanc