recomb satellite workshop, 2007 algorithms for association mapping of complex diseases with...

18
RECOMB Satellite Workshop , 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis

Post on 22-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

RECOMB Satellite Workshop, 2007

Algorithms for Association Mapping of Complex Diseases With

Ancestral Recombination Graphs

Yufeng Wu

UC Davis

2

Association (or LD) Mapping

• Given a subset of SNPs from unrelated individuals, find unobserved genetic variations that strongly discriminate individuals with the trait (cases) and those without the trait (controls)

• Complex Diseases: difficult to map

3

Illustration (Zollner and Pritchard, Genetics, 2005)

Cases

ControlsSNP markers

1: 0011012: 1100003: 0011104: 0010005: 0000106: 1111017: 1000118: 1100019: 11001010: 10001111: 01000012: 101101

4

Some Challenges in Association Mapping

1 2

5

The Genealogy Approach

• “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard

• Goal: infer genealogy from marker data with recombination– Approximation (e.g. in Zollner and Pritchard)

6

Ancestral Recombination Graph (ARG)

10 01 00

S1 = 00S2 = 01S3 = 10S4 = 10

MutationsS1 = 00S2 = 01S3 = 10S4 = 11

10 01 0011

Recombination

Assumption:

at most one mutation per site

1 0 0 1

1 1

7

Full-ARG Approaches

• First full ARG mapping method (Minichiello and Durbin)– Use full plausible ARG, but heuristic– Less complex disease model

• Our results (Wu, 2007)– Sampling full ARGs with provable property, and work

on more complex disease model– Focus on parsimonious history

• minARGs: ARGs that use the minimum number of recombinations

• Near minimum ARGs

– Uniform sampling of minARGs

8

Special Case: ARG with Only Input Sequences

• Self-derivability (SD) Problem: construct an ARG with only the input sequences

• In fact, such ARG, if exits, must be a minARG

• Runs in O(2n) time

• Heuristics to extend to non-self-derivable data

9

00000

01000

01100

01101

11000

00010

11011

00011 1 2

00000

01000

01100

01101

11000

00010

00011

11011

N1=164

00000

01000

01100

11000

00010

11011

00011

01101

N2=76N = 164*1 + 76*2

= 316

Counting Self-derived ARGs

00000

01000

01100

01101

11000

00010

11011

00011 1 2

00000

01000

01100

01101

11000

00010

00011

11011

164

00000

01000

01100

11000

00010

11011

00011

01101

76

1. Random value Rnd = 0.3 < 0.52

316

Select 11011 with prob = 164/316 = 0.52, and 01101 with prob = 76*2/316 = 0.48

2. Pick seq = 11011 as last row to derive

3. Move to reduced matrix

11

ARGs Represents a Set of Marginal Trees

• Clear separation of cases/controls: NOT expected for complex diseases!

12

Disease Model (Zollner & Pritchard)

Disease mutations: Poisson Process

Two alleles: wild-type and mutant

0.05

0.05

0.05 0.05

0.1

0.1

0.050.05

13

Disease Penetrance (Zollner & Pritchard)

PA,1: probability of a mutant sequence becomes a casePC,1 = 1.0 - PA,1

PA,0: probability of a wild-type sequence becomes a casePC,0 = 1.0 - PA,0

0.05

0.05

0.05 0.05

0.1

0.1

0.050.05

Case

Control

14

Phenotype Likelihood (Zollner and Pritchard)

• Given a tree Tx at position x and case/control phenotype of its leaves, what is the probability Pr( | Tx) of observing on Tx? (Zollner & Pritchard)

– Sum over all subset of mutated edges

• Adopted in this work

15

Expected Phenotype Likelihood

• Need for assessing statistical significance.• Null model: randomly permute case/control

labels.• Our result: O(n3) algorithm for computing

expected value of phenotype likelihood.– Exact, fully deterministic method.

16

Diploid Penetrance

Diploid: two sequences per individual

Diploid enetrance:

PA,00: prob. Individual with two wild-type sequences becomes a case

PA,01 : …, PA,11: …

Case

Control

Efficient computation of phenotype likelihood: stated but unresolved in Zollner and Pritchard

Our result (Wu, 2007): computing phenotype likelihood with diploid penetrance is NP-hard

17

Simulation Results

Comparison: TMARG (uniform), TMARG (pathway), LATAG, MARGARITA

50 ARGs per data

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

Uniform Pathway LATAG MARGARITA

50/5000 ARGs per data

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

n50 n5000 LATAG MARGRITA

18

Acknowledgement

• Software available at: http://wwwcsif.cs.ucdavis.edu/~wuyu

• I want to thank– Dan Gusfield– Dan Brown– Chuck Langley– Yun S. Song