ibm1 an algorithm for exploring patterns in clinical genomic data richard mushlin and aaron...

25
IBM 1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 1

An Algorithm ForExploring Patterns In

Clinical Genomic Data

Richard Mushlin and

Aaron Kershenbaum

IBM T.J. Watson Research Center

Page 2: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 2

Some Questions

1. What do people with a disease (cases) have in common that people without the disease (controls) don’t?

2. How “exact” is the answer?

3. What does the answer tell us about the disease etiology?

Page 3: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 3

Our Approach

1. Find patterns in cases

2. Compare frequencies with same patterns in controls

3. Score patterns

4. Find relationships between patterns

5. Evaluate effect of patterns on biological pathways

Page 4: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 4

Find Patterns

• Combination of– Brute force– Thresholds– Directed search

Page 5: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 5

Raw Data

Feature1 Feature2 Feature3 Feature..

Person1 “F1=a” “F2=a” “F3=a” …

Person2 “F1=a” “F2=a” “F3=c” …

Person3 “F1=a” “F2=b” “F3=c” …

Person.. … … … “value”

Page 6: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 6

Bipartite Graph Representation

People Feature values

P1 “F1=a”

P2

P3

“F2=a”

“F2=b”

“F3=a”

“F3=c”

Page 7: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 7

Find Maximal Bicliques

• Biclique:– Subgraph of bipartite graph– All people are connected to all feature values

• Maximal biclique:– Cannot add person to biclique without losing

feature value– Cannot add feature value to biclique without

losing people

Page 8: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 8

Adjacency RepresentationAbstract D1 D2 D3

Actual P1 P2 P3

S1 “F1=a” 1 0 1

S2 “F1=b” 0 0 0

S3 “F1=c” 0 1 0

S4 “F2=a” 1 1 0

S5 “F2=b” 0 0 1

S6 “F2=c” 0 0 0

S7 “F3=a” 1 0 0

S8 “F3=b” 0 0 0

S9 “F3=c” 0 1 1

Page 9: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 9

Start With Singletons

D1 D2 D3

S1 1 0 1

S2 0 0 0

S3 0 1 0

S4 1 1 0

S5 0 0 1

S6 0 0 0

S7 1 0 0

S8 0 0 0

S9 0 1 1

Page 10: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 10

Save Acceptable Candidates

• Candidate is acceptable (so far) if:– Biclique [{S},{D}] is maximal– {S} has “enough” elements– {D} has “enough” elements– Score (figure of merit) is “good enough”

• Thresholds for {S}, {D}, score, set as parameters• Candidates saved in priority queue by score

Page 11: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 11

Compute Neighbor Set

D1 D2 D3Candidate S1 1 0 1Neighbor S4 1 1 0Neighbor S5 0 0 1Neighbor S7 1 0 0Neighbor S9 0 1 1

• For candidate C = [{Sc},{Dc}], neighbor set {Nc} is the set of feature values in the original data that has at least one person in {Dc}

• For singleton candidate [{S1},{D1,D3}], the neighbor set is {S4, S5, S7, S9}

Page 12: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 12

Expand Candidates

1. Pop “current best” candidate Co off priority queue

2. Create new candidates from neighbors {NCi} by taking unions and intersections:

• SCi = {SCo} U {SNi}• DCi = {DCo} ∩ {DNi}

3. Save acceptable candidates in queue

Page 13: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 13

Expansion Example

P1 P2 P3

Candidate 0 S1 1 0 1

Neighbor i S4 1 1 0

New Candidate i S1, S4 1 0 0

Neighbor j S9 0 1 1

New Candidate j S1, S9 0 0 1

Page 14: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 14

F.O.M. And The Priority Queue

• Various criteria can be used to calculate a figure of merit (score)

• Working queue size set as parameter

• Working queue is allowed to fill up

• Buffer is emptied when trigger reached

Top

Cand 1

Cand 2

Bottom

Buffer

Cand n

Trigger

F.O.M.

Page 15: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 15

F.O.M. And Search Strategy

• The search strategy is embodied in the evaluation of the “<“ operator for bicliques

• The candidate queue is prioritized in the same order as the scores

• The scoring function can be externalized

• The search strategy can be changed without changing the search machinery

Page 16: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 16

Scoring Case/Control Problems

Cases Controls Total

Match a b a+b

No match c d c+d

Total Ncases Ncontrols Ntotal

Given Measured Derived

Page 17: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 17

Scoring Example

a b a+b

c d c+d

Ncases Ncontrols Ntotal

( a * d )Odds Ratio (OR) = ( b * c )

FOM = abs ( log ( OR ) )

Page 18: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 18

Statistical Significance

a b a+b

c d c+d

Ncases Ncontrols Ntotal

FOMadj = FOM * ( 1 – q )

(a+b)! (c+d)! (Ncases)! (Ncontrols)!p = a! b! c! d! (Ntotal)!

q = ∑ p, for all tables with same margins and better FOM

Page 19: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 19

Structure Of The Output

• Algorithm yields a collection of related patterns (bicliques)

• Question of when to “lump” and when to “split” related patterns

• Lattice structure helps us decide

Page 20: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 20

Lattice (simplified)

• A lattice can be represented as a graph with special properties (Hasse diagram)

• In the context of bicliques, each node B is characterized by 2 sets: S and D

• A directed edge exists from node B1 to B2 if and only if– S1 is a subset of S2 and– D2 is a subset of D1

Page 21: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 21

Lattice ExampleNull; 1,2,3,4

A; 1,2,3 B; 2,3,4

B,C; 3,4

A,B,C,D; Null

C; 1,3,4

A,B; 2,3

A,B,C; 3 B,C,D; 4

A,C; 1,3 C,D; 1,4

A,C,D; 1

Page 22: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 22

Score jumpLattice

Page 23: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 23

Real SNP Data (a few rows)

11581 AG AA CG AA AA CC AA GG AA AG 67

11643 GG AG CG CA AA CT AA GG AA GG 67

11644 GG AG CG AA AA CT AA GG AA GG 67

11647 GG AG CC AA AA CC AA GG AA GG 66

11657 GG AA CG AA AA CC AA GG AA AG 67

11660 GG AG CG CC AA TT AA Un AA AG 67

11664 GG AG CC Un AA CC AA GA AA AG 67

11665 GG AA CG CA AG CT AA GG AA AG 66

11666 GG AG CG AA AA CT AA GG AA GG 66

11668 GG AA CG AA AA TT AA GG AA GG 66

11669 GG AG CG CA AA CT AA GA AA GG 66

Page 24: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 24

SNPLattice

Page 25: IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center

IBM 25

“Reading” The Lattice

Small NsLarge NdLow FOM

+S -D

+S” -D”

HighFOM

Higher FOM

Background context

Effect of adding Sin the context of

Lower FOM

+S’-D’

Children of may have better or worse scores, but are “similar” to

Score jump

Scores similar