the causes of variation

64
The Causes of Variation Lindon Eaves and Tim York Boulder, CO March 2001

Upload: brendan-simpson

Post on 03-Jan-2016

29 views

Category:

Documents


3 download

DESCRIPTION

The Causes of Variation. Lindon Eaves and Tim York Boulder, CO March 2001. One Issue (Among Many!). Identifying genes that cause complex diseases and genes that contribute to variation in quantitative traits. Quantitative Trait Locus (QTL). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Causes of Variation

The Causes of Variation

Lindon Eaves and Tim York

Boulder, CO

March 2001

Page 2: The Causes of Variation

One Issue (Among Many!)

• Identifying genes that cause complex diseases and genes that contribute to variation in quantitative traits

Page 3: The Causes of Variation

Quantitative Trait Locus (QTL)Any gene whose contribution to variation in a quantitative trait is

large enough to stand out against the background noise of other

genetic and environmental factors

Page 4: The Causes of Variation

Quantitative Trait

A continuously variable trait (in which variation may be caused by

multiple genetic and/or environmental factors); any

categorical trait in which differences between categories

may be mapped onto variation in a continuous trait

Page 5: The Causes of Variation

Common diseases

• Estimated life time risk c.60%

• Substantial genetic component

• “Non-Mendelian” inheritance

• Non-genetic risk factors

• Multiple interacting pathways

• Most genes still not mapped

Page 6: The Causes of Variation

Examples

• Ischaemic heart disease (30-50%, F-M)

• Breast cancer (12%, F)

• Colorectal cancer (5%)

• Recurrent major depression (10%)

• ADHD (5%)

• Non-insulin dependent diabetes (5%)

• Essential hypertension (10-25%)

Page 7: The Causes of Variation

Even for “simple” diseases:Number of alleles is large

(Wright et al, 1999)

• Ischaemic heart disease (LDR) >190

• Breast cancer (BRAC1) >300

• Colorectal cancer (MLN1) >140

Page 8: The Causes of Variation

Definitions

• Locus: One of c. 30-40,000 genes• Allele: One of several variants of a specific

gene• Gene: a sequence of DNA that codes for a

specific function • Base pair: chemical “letter” of the genome (a

gene has many 1000’s of base pairs)• Genome: all the genes considered together

Page 9: The Causes of Variation

Finding QTLs

• Linkage

• Association

Page 10: The Causes of Variation

Linkage

Finds QTLs by correlating phenotypic similarity with genetic similarity (“IBD”) in specific parts

of genome

Page 11: The Causes of Variation

Linkage

• Doesn’t depend on “guessing gene”

• Works over broad regions (good for getting in right ball-park) and whole genome (“genome scan”)

• Only detects large effects (>10%)

• Requires large samples (10,000’s?)

• Can’t guarantee close to gene

Page 12: The Causes of Variation

Association

• Looks for correlation between specific alleles and phenotype (trait value, disease risk)

Page 13: The Causes of Variation

Association

• More sensitive to small effects• Need to “guess” gene/alleles

(“candidate gene”) or be close enough for linkage disequilibrium with nearby loci

• May get spurious association (“stratification”) – need to have genetic controls to be convinced

Page 14: The Causes of Variation

“Reality”:For complex disorders and

quantitative traitsLarge number of alleles at large

number of genes

Page 15: The Causes of Variation

Defining the Haystack

• 3x109 base pairs• Markers every 6-10kb for association in

populations with no recent bottleneck history • 1 SNPs per 721 b.p. (Wang et al., 1998)• c.14 SNPs per 10kb = 1000s

haplotypes/alleles • O (104 -105) genes

Page 16: The Causes of Variation

Problems• Large number of loci and alleles/haplotypes• Possible interactions between genes• Possible interactions between genes and

environment• Relatively low frequencies of individual risk

factors• Functional form of genotype-phenotype relations

not known• Sorting out signal from noise – minimizing errors

within budget• Scaling of phenotype (continuous, discontinuous)• Spurious association (stratification)

Page 17: The Causes of Variation

Prepare for the worst

Need statistical approaches that can screen enormous numbers of loci and alleles to identify reliably those that have impact on risk to

disease

Page 18: The Causes of Variation

System Chosen for Study

• 100 loci• 20 loci affect outcome, 80 “nuisance” genes• 257 alleles/locus• Allele frequencies c.20-0.1%• Disease genes each explain 2.5% variance in

risk (c. 2-fold risk increase)• 40% rarest alleles increase risk• 50% variance non-genetic

Page 19: The Causes of Variation
Page 20: The Causes of Variation

It’s a Mess!

• Don’t know which genes – might have clues

• Don’t know which alleles – unordered categories

• >250100 locus/allele combinations• More predictor combinations than

people (“curse of dimensionality”)• Reality worse

Page 21: The Causes of Variation

Problems

• Informatics: large volume of data

• Computational: large number of combinations

• Statistical: large number of chance associations

• Genetic-epidemiological: secondary associations

Page 22: The Causes of Variation
Page 23: The Causes of Variation
Page 24: The Causes of Variation
Page 25: The Causes of Variation
Page 26: The Causes of Variation

How are we going to figure it out?

Page 27: The Causes of Variation

Data Mining(Steinberg and Cartel)

Data Mining(Steinberg and Cartel)

• Attempt to discover possibly very complex structure in huge databases (large number of records and large number of variables)

• Problems include classification, regression, clustering, association (market analysis)

• Need tools to partially or fully automate the discovery process

• Large databases support search for rare but important patterns and interactions (epistasis, GxE)

Page 28: The Causes of Variation

Some Approaches to DM

• Logistic regression

• Neural networks

• “CART” (Breiman et al. 1984)

• “MARS” (Friedman, 1991)

Page 29: The Causes of Variation

“MARS”

• Multivariate

• Adaptive

• Regression

• Splines

Page 30: The Causes of Variation

Key references

Friedman, J.H. (1991) Multivariate Adaptive Regression Splines (with discussion), Annals of Statistics, 19: 1-141.

Steinberg, D., Bernstein, B., Colla, P., Martin, K., Friedman, J.H. (1999) MARS User Guide. San Diego, CA: Salford Systems

Page 31: The Causes of Variation

The MARS Advantage

• Allows large number of predictors (loci/alleles/environments) to be screened

• Non-parametric• Continuous and discontinuous outcomes• Systematic search for detailed interactions• Testing and cross-validation• Continuous and categorical predictors• Decides best form of relationship

Page 32: The Causes of Variation

Example Regression Spline:

Impact of Non-Retail Business on Median Boston House Prices

0

5

10

15

20

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

INDUS

Curve 1: Maximum = 19.08890

“Knot”

Median House Price

Industrial Business

Model for spline: b1 = max(0, INDUS - 8.140) b2 = max(0, 8.140 - INDUS ) Y = 20.968 - 0.268 b1 + 1.802 b2

Page 33: The Causes of Variation

Fitting functions with Splines• Piece-wise linear regression.

– simplest form. allow regression to bend.

• “Knots” define where the function changes behavior.• Local fit vs. Global fit.

actual data spline with 3 knots

Page 34: The Causes of Variation

One predictor example

True knots at 20 and 45 (left)

Best single knot at about 35 (right)

10 20 30 40 50 60 10 20 30 40 50 60

YY

X X

Page 35: The Causes of Variation

10 20 30 40 50 60 10 20 30 40 50 60

10 20 30 40 50 60 10 20 30 40 50 60

Page 36: The Causes of Variation

Re-express variables as basis functions

• Done to generalize the search for knots. Difficult to illustrate splines with > one dimension.

• Core building block of MARS model– max (0, X – c);– example: BF1 = max(0, ENV – 5);

BF2 = max(0, ENV – 8);

0 for ENV <= 5;

1 for 5 <= ENV <= 8;

1 + 2 for ENV > 8;

• Weighted sum of basis functions used to approximate the global function.– ie y = constant + 1 * BF1 + 2 * BF2 + error;

Page 37: The Causes of Variation

“Adaptive” Spline

• “Optimal” placement of knots

• “Optimal” selection of predictors and interactions

Page 38: The Causes of Variation

Adaptive splines

• Problem:– What is the optimal location of knots?– How many knots do you need?– Best to test all variable / knot locations, but

computationally burdensome.

• MARS solution:– Develop an overfit model with too many knots.– Remove all knots that contribute little to model

quality.– The final model should have approximately correct

knot locations.

Page 39: The Causes of Variation

“Optimal”

Explains “salient” features of data

Ignores irrelevant features

Stands up to replication

- Several ways to operationalize mathematically

Page 40: The Causes of Variation

MARS 2-step model building• Step 1. Growing phase:

– begins with only a constant in the model.– serially adds basis functions to a user defined limit. tests

each for improvement when added to the model. – addition of basis functions until an overly large model is

found. (theoretically the true model is captured).

• Step 2. Pruning phase:– delete basis function that contributes least to model fit.– refit the model and delete next term, repeat.– the most parsimonious model is selected.

• GCV criterion to select optimal model (Craven 1979).

• MARS option uses 10 fold cross-validation to estimate DF.

Page 41: The Causes of Variation

Cross-validation• Protects against over fitting data.• Develops a model on subset of data. Tests fit

on remaining set.• Systematically assesses how many DF to

charge each variable entered into model.– Adding a basis function will always lower MSE.– This reduction is penalized by DF charged.

• Only backwards deletion step is penalized.

Page 42: The Causes of Variation

Genetic Example:Regression spline for multi-allelic locus

Probability of disease = 0.037 + 0.114 b1. Where:

b1 = 1 if ( LOCUS1 = 30 OR LOCUS1 = 37 OR LOCUS1 = 39 OR LOCUS1 = 43 OR LOCUS1 = 44 OR LOCUS1 = 46 OR LOCUS1 = 66 OR LOCUS1 = 73 OR LOCUS1 = 76 OR LOCUS1 = 78 OR LOCUS1 = 79 OR LOCUS1 = 80 OR LOCUS1 = 83 OR LOCUS1 = 87 OR LOCUS1 = 90 OR LOCUS1 = 95 OR LOCUS1 = 103 OR LOCUS1 = 106 OR LOCUS1 = 111 OR LOCUS1 = 113 OR LOCUS1 = 114 OR LOCUS1 = 116 OR LOCUS1 = 118 OR LOCUS1 = 128 OR LOCUS1 = 129 OR LOCUS1 = 133 OR LOCUS1 = 134 OR LOCUS1 = 139 OR LOCUS1 = 146 OR LOCUS1 = 147 OR LOCUS1 = 148 OR LOCUS1 = 170 OR LOCUS1 = 177 OR LOCUS1 = 179 OR LOCUS1 = 182 OR LOCUS1 = 183 OR LOCUS1 = 185 OR LOCUS1 = 192 OR LOCUS1 = 202 OR LOCUS1 = 208 OR LOCUS1 = 209 OR LOCUS1 = 214 OR LOCUS1 = 215 OR LOCUS1 = 218 OR LOCUS1 = 219 OR LOCUS1 = 222 OR LOCUS1 = 223 OR LOCUS1 = 226 OR LOCUS1 = 229 OR LOCUS1 = 230 OR LOCUS1 = 231 OR LOCUS1 = 232 OR LOCUS1 = 235 OR LOCUS1 = 236 OR LOCUS1 = 237 OR LOCUS1 = 240 OR LOCUS1 = 241 OR LOCUS1 = 242 OR LOCUS1 = 244 OR LOCUS1 = 253 OR LOCUS1 = 254),

b1 = 0 otherwise

Page 43: The Causes of Variation

What happens when nothing is going on? Including only “nuisance” loci (21-80). N=10,000.

Validation Loci Identified

None 23 25 30 32 35-37 40 47 50 54 55 57 64 68 72 74 7687 89 91 92 94 96 97

10-fold cross-validation 25

Page 44: The Causes of Variation

Loci Identified as contributing to variation in outcome

Sample Size Validation Loci Identified

1000None 2 5-8 10-12 14-18 20 24 40 43 45 56 59 70 77 94

Split-sample 7 10 14

10-fold 14

2000None 2 3 5 6 8-18 20 38 45 47 69 72 80 88 95 100

Split-sample 12 14 20

10-fold 14

5000None 2-20 29 32 43 55 56 74

Split-sample 10 15 16 20

10–fold 2-19

10000None 1-20 25 26 94

Split-sample 1-20 25 94

10-fold 1-20

Page 45: The Causes of Variation

Correct (+) and Incorrect (-) Assignment of Alleles to High- and Low-Risk Groups byMARS Model (N=10,000)

Low Risk (N=30)

High Risk (N=227)

Low Risk (N=30)

High Risk (N=227)

Locus + - + - Locus + - + -

1 29 1 146 81 11 29 1 155 72

2 29 1 145 82 12 29 1 147 80

3 29 1 152 75 13 30 0 155 72

4 30 0 138 89 14 30 0 149 78

5 30 0 142 85 15 29 1 170 57

6 28 2 139 88 16 30 0 150 77

7 28 2 143 84 17 28 2 151 76

8 29 1 148 79 18 28 2 147 80

9 27 3 154 73 19 29 1 140 87

10 29 1 157 70 20 29 1 146 81

Page 46: The Causes of Variation

So Far:

Does quite well for largish random samples and continuous

outcomes.

-What about disease (dichotomous) outcomes?

-What about selected (extreme) samples?

Page 47: The Causes of Variation

Generating Dichotomous Outcomes from Continuous Measure

Threshold Prevalence

21 9.1%

22 4.9%

24 1.0%

Page 48: The Causes of Variation

Loci Identified by fitting MARS model to dichotomous outcomes (N=10,000)

Prevalence No validation 10-fold cross validation

9.1% 1 2 5 6 8 9 11-17 19 16

4.9% 1 2 4 5 6 910 13-15 17-20 8

1.0% 1 2 5 8 9 10-17 19 56 2

Page 49: The Causes of Variation

Loci cross-validated by MARS model for extremes from sample of 10,000screened individuals

Proportion Selected Outcome Loci Cross-Validated

Upper % Lower % Total N

9.2 11.2 2024 Continuous 1-3 5-10 66 88 75

Dichotomous 2 3 5-29 69

4.9 6.3 1116 Continuous 1-3 6-10 12-15 18 20 68

Dichotomous 1-4 6-8 10-15 17 19 48

Page 50: The Causes of Variation

So?

• Can detect signal due to relatively large numbers of relatively rare unordered alleles of relatively small effect at relatively many loci amid the noise of still more loci and environmental effects

• “MARS” may provide elements for analyzing such data in this and similar contexts (?micro- arrays, SNPs, expression arrays?)

• Works with continuous data on random samples and dichotomous outcomes on selected samples

Page 51: The Causes of Variation

GAW12 – Simulated data

• Provided for two populations:– large general pop.

– pop. isolate – founded 20 generations ago by 100 ind.

– limited migration b/w.

• Common disease:– prevalence of 25%. increases with age

– middle age disease, some early onset

– more common in females than males

Page 52: The Causes of Variation

• General population– 7 genes simulated– 13 to 20 kb– 12 to 40 diallelic sites at start of simulation– passed through 120 to 200K of random mating:

• mutation, intragenic recombination, gene conversion – allowed at diff. rates for diff. genes

• each gene contains a 500bp recombination hotspot – 15 to 65% of intragenic recombinations

• 8 to 13 mutational hotspots per gene (6 – 300 x’s )

– 25% of genes isolated for 35 to 85K generations.

Page 53: The Causes of Variation

GENE1 GENE5

Length (kb) 20 17

Start # of SNP 40 20

Random Mating 150K 165K

Rec. rate .01 .002

Mutation rate 4x10-8 6x10-9

Gene conv. .01 .002

Mean length conv.

1000 1600

Start of rec. hotspot / % in

10349 / 50% 4197 / 65%

# mutat. hotspot 13 8

Incr mut rate 200 20

Page 54: The Causes of Variation

• Isolate population– loosely modeled after pop. history of Old Order Amish

in Lancaster Co., PA

– Founders: 200 chr.’s sampled from general pop.

– 20,000 chr.’s sampled from general pop. to create an “outside pop”

– Isolate: children <12, mean 4 ; Outside: children <12, 1

– migration allowed b/w pop.s at each generation• rate: migrants = 5% of current isolate size

– evolution progressed for 20 generations with recombination (no mutations, no intragenic rec.)

– founders were then sampled to create the isolate pop.

Page 55: The Causes of Variation

• 23 extended pedigrees with 1,497 individuals from each population. (1,000 living)

• Pedigrees include the proband, spouse, and all first, second, and third degree relatives of each.

• Living individuals are provided:– affected status, fid, mid, sex– age at last exam– age of onset if affected– 5 quantitative risk factors– 2 environmental risk factors (binary and quantitative)– marker genotype for 1 cM whole genome screen. 2,855 total

markers with an average of 9.1 alleles– sequence data for 7 candidate genes – 1,176 sequence variants

• 50 replicates provided for each pop.

Page 56: The Causes of Variation
Page 57: The Causes of Variation

Sequence data

• Isolate and General population• Intron and Exon sequence from 7 candidate genes.• Kept only those individuals with sequence data.

Each set contain 7,000 individuals. 64 mb MARS limit.

• 5 sets of 7 randomly selected replicates (used 35 of 50 replicates provided)

• 5 associated quantitative risk factors.• Covariates included: E1, E2, Age, Sex, Age of

onset.

Page 58: The Causes of Variation

• Affected status binary.• Exon sequence coded for each individual as

having 0, 1, or 2 ancestral variants. • If intron variant present (whether 1 or 2 copies)

given a value of 1. Coded in binary form as haplotypes of length four.

Page 59: The Causes of Variation

Liability

Q1 Q2 Q3 Q4 Q5

E1

Aff Status Age of onset

MG1

MG6

MG5MG2 MG3 MG4E2 Age

CG1

CG2CG6

Page 60: The Causes of Variation

True Model Isolate pop. General pop.

AFF E1, Q1-Q5, MG6 [557]

E1, Q1-Q5, MG6 [(435 547 548 557)

5244 5268 6912 7281]

E1, Q1-Q5, MG6 [(27 57 76 110)(435

547 548 557)]

Q1 E1, MG1 [5782] MG1 [5007] MG1 [5782]

Q2 E1, MG1 [5782] E1, MG1 [5007] E1, MG1 [5782]

Q3 E1, E2 E1, E2 E1, E2

Q4 E1, AGE E1, AGE E1, AGE

Q5 E1, MG5 [multi-allelic] E1, MG5 [1289

3745 8657 8817] E1, MG5 [1289

3745 8657 8817]

ONSET MG6 [557] MG6 [15625] none

Page 61: The Causes of Variation

Conclusions• MARS works well to capture functional form of

disease etiology in simulated data with dichotomous outcome.

• In most cases was within 1 Kb of functional variant.• Generated a predictive model that was replicable in at

least 4 of 5 data sets.• Highly interpretable output in the form of basis

functions and Importance values.• MARS may have problems with highly correlated

variables.• Pattern-recognition tools can be useful to narrow

down search for genes.

Page 62: The Causes of Variation

Comparison of MARS and ANN

MARS ANNBoth are non-parametric estimation schemes, allow for a high number of input

predictors, allow for interactions, & non-linear mappings.

Maximum allowable basis functions and degree of interactions.

Type of network architecture needs to be specified.

Models are developed fast. Models are trained more slowly (DeVeaux et al. 1993).

Backwards elimination stage to remove unnecessary basis functions.

Problem of overfitting the data esp. with small data sets.

Easily interpretable basis functions. Local interpretation of the function.

Black box-weights have little meaning. Diff. to interpret predictor contribution

Penalizes model complexity. Tries to dev. a low order, interpretable model.

Non-linear transformations and high connectivity allows for complexity.

Page 63: The Causes of Variation

But the Haystack is Very Large

• Reality worse than simulations• More alleles at more loci• Phenotypes more complex

(multivariate)• More irrelevant loci (?1000’s)• Interactions with environment and

between loci• Spurious associations

Page 64: The Causes of Variation

It Needs Collaboration

Clinical

Statistical

Molecular

Epidemiological

Physiological

Developmental

Informational

Evolutionary