the complexities of data analysis in human genetics marylyn deriggi ritchie, ph.d. center for human...

Post on 13-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Complexities of Data Analysis in Human Genetics

Marylyn DeRiggi Ritchie, Ph.D.Center for Human Genetics Research

Vanderbilt UniversityNashville, TN

Biology is complex

BioCarta

Single nucleotide polymorphisms (SNPs)

Mendelian Traits

Aa Aa

Aa

BB bb

AA

aa

BB Bb bbAa AA AaBB bb Bb

Locus 1

Locus 2

AABB AABb AAbb

AaBB AaBb Aabb

aaBB aaBb aabb

affected

affected

affected

Complex Traits

Aa Aa

Aa

BB Bb

AA

aa

BB Bb bbaa AA AaBB bb Bb

Locus 1

Locus 2

AABB AABb AAbb

AaBB AaBb Aabb

aaBB aaBb aabb

affected

affected

Complex Traits

• Complex trait implies the involvement of multiple genes and/or environmental factors

• Mendelian trait implies a single mutation

• Mendelian traits are generally rare

• Complex traits are common and of substantial public health impact

Genetic Analysis

• Two main areas of genetic analysis1. Linkage analysis

2. Association analysis

• Methods have been developed for each approach for a variety of different study designs

Association Analysis

• In disease studies, when the disease gene is unknown, we look for association between genetic markers and the disease

• If a marker occurs more frequently or less frequently in affected individuals than in unaffected individuals, then it is associated with the disease.

Association Analysis

• Case-control studies– Test for association between marker alleles

and the disease phenotype in a group of affected and unaffected individuals randomly from the population

• Family-based studies– Test for association between marker alleles

and the disease phenotype in a group of affected individuals and unaffected family members

Case-control data structureStatus SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10

1 1 2 2 1 2 1 2 2 1 2

1 0 0 0 1 0 0 0 0 1 0

1 0 2 0 1 1 0 2 0 1 1

1 2 0 1 1 0 2 0 1 1 0

1 2 1 1 0 0 2 1 1 0 0

1 1 0 0 0 0 1 0 0 0 0

1 1 1 0 1 2 1 1 0 1 2

1 1 0 1 0 2 1 0 1 0 2

1 0 0 0 2 0 0 0 0 2 0

1 0 0 1 0 1 0 0 1 0 1

0 2 1 0 1 0 2 1 0 1 0

0 0 1 1 0 0 0 1 1 0 0

0 1 1 0 2 1 1 1 0 2 1

0 0 0 2 0 1 0 0 2 0 1

0 2 1 0 1 1 2 1 0 1 1

0 0 0 2 0 0 0 0 2 0 0

0 1 0 0 1 2 1 0 0 1 2

0 0 1 1 1 2 0 1 1 1 2

0 1 1 0 0 2 1 1 0 0 2

0 0 1 2 0 0 0 1 2 0 0

Association Analysis

• Single marker tests

• Haplotype association

• Epistasis

Single marker tests

SNP1

Disease DiseaseDisease

? ? ?

SNP2 SNP3

Haplotype

Haplotype Analysis

• May be able to increase power by testing for association with marker haplotype

• Haplotype is a block of DNA that stays intact through generations

• Do not directly observe marker haplotypes

• Use likelihood methods to infer

Haplotype Analysis

Epistasis: Gene-Gene InteractionsW. Bateson, Mendel’s Principles of Heredity (1909)

A.R. Templeton, In: Wade et al. (eds), Epistasis and the Evolutionary Process (2000)

• Epistasis first used by William Bateson (1909) • Literal translation is “standing upon” (I.e. one gene

masks the effects of another gene).

Genotype at Locus A

Genotype at Locus B

BB Bb bb

AA White Grey Grey

Aa Black Grey Grey

Aa Black Grey Grey

Cordell, Human Molecular Genetics 11:2463-8 (2002)

Gene-gene Interactions

• Searching for gene-gene interactions brings about a whole new suite of problems and challenges

• Types of interactions– Additive– Multiplicative– Epistatic

• Curse of dimensionality – big problem

Curse of Dimensionality

AA Aa aa

SNP 1

N = 100 50 Cases, 50 Controls

SNP 2

AA Aa aa

BB

Bb

bb

N = 100 50 Cases, 50 Controls

SNP 1

Curse of Dimensionality

N = 100

50 Cases, 50 Controls

AA Aa aaBBBbbb

CC Cc cc

DD

Dd

dd

AA Aa aaAA Aa aa

BBBbbb

BBBbbb

SNP 1 SNP 1 SNP 1

SN

P 2

SN

P 2

SN

P 2

SN

P 4

SNP 3

Curse of Dimensionality

Three Other Issues to Consider

1. Variable selection

2. Model selection

3. Interpretation

1. Variable Selection

• How can you determine which variables to select?

• Not computationally feasible to evaluate all possible combinations

• Need to select correct variables to detect interactions

How many combinations are there?• ~500,000 SNPs span 80% of common variation in genome (HapMap)

SNPs in each subset

1 2 3 4 5

5 x 105

2 x 1016

1 x 1011

3 x 1021

2 x 1026

Num

ber

of P

ossi

ble

Com

bina

tions

How many combinations are there?• ~500,000 SNPs span 80% of common variation in genome (HapMap)

SNPs in each subset

1 2 3 4 5

5 x 105

2 x 1016

1 x 1011

3 x 1021

2 x 1026

Num

ber

of P

ossi

ble

Com

bina

tions 2 x 1026 combinations

* 1 combination per second

* 86400 seconds per day

---------

2.979536 x 1021 days to complete

(8.163113 x 1018 years)

2. Model Selection

• For each variable subset, evaluate a statistical model

• Goal is to identify the best subset of variables that compose the best model

Finding the best model

Choose variable subset

Choose statistical model

Evaluate model fitness

Best model

Simple Fitness Landscape

Model

Fitn

ess

Complex Fitness LandscapeF

itnes

s

Model

3. Interpretation

• Selection of best statistical model in a vast search space of possible models

• Statistical or computational model may not translate into biology

• May not be able to identify prevention or treatment strategies directly

• Wet lab experiments will be necessary, but may not be sufficient

3. Interpretation

• Strategies to assess biological interpretation of gene-gene interaction models

1. Consider current knowledge about the biochemistry of the system and the biological plausibility of the models

2. Perform experiments in the wet lab to measure the effect of small perturbations to the system

3. Computer simulation algorithms to model biochemical systems

Additional Challenges(true of all association studies)

• Sample size and power/type I error

• Population specific effects– Age, gender

• Poorly matched cases and controls– Ethnic background– Controls must be “at risk”

• Bias

• Heterogeneity

Heterogeneity

• Phenotypic (Clinical, Trait)– Affected individuals vary in clinical expression

• Genetic– Different inheritance patterns for same disease

• Locus– Different genes lead to the same disease

• Allelic– Different alleles at the same gene lead to

same/different disease

Thornton-Wells TA, Moore JH, Haines JL. Trends in Genetics, 2004;20(12):640-7. .

New Statistical Approaches• Data Reduction

– Combinatorial Partitioning Method (CPM)– Multifactor Dimensionality Reduction (MDR)– Detection of informative combined effects (DICE)– Logic Regression– Set Association Analysis

• Pattern Recognition– Symbolic Discriminant Analysis (SDA)– Cellular Automata (CA)– Neural Networks (NN)

Areas of Future Work(possible collaborations)

• More analytical methods for gene-gene and gene-environment interactions– Especially including categorical and

continuous variables simultaneously

• Inclusion of pathway information into analyses

• Ways of dealing with heterogeneity of all kinds

top related