ibm1 an algorithm for exploring patterns in clinical genomic data richard mushlin and aaron...

IBM 1

An Algorithm ForExploring Patterns In

Clinical Genomic Data

Richard Mushlin and

Aaron Kershenbaum

IBM T.J. Watson Research Center

IBM 2

Some Questions

1. What do people with a disease (cases) have in common that people without the disease (controls) don’t?

2. How “exact” is the answer?

3. What does the answer tell us about the disease etiology?

IBM 3

Our Approach

1. Find patterns in cases

2. Compare frequencies with same patterns in controls

3. Score patterns

4. Find relationships between patterns

5. Evaluate effect of patterns on biological pathways

IBM 4

Find Patterns

• Combination of– Brute force– Thresholds– Directed search

IBM 5

Raw Data

Feature1 Feature2 Feature3 Feature..

Person1 “F1=a” “F2=a” “F3=a” …

Person2 “F1=a” “F2=a” “F3=c” …

Person3 “F1=a” “F2=b” “F3=c” …

Person.. … … … “value”

IBM 6

Bipartite Graph Representation

People Feature values

P1 “F1=a”

P2

P3

“F2=a”

“F2=b”

“F3=a”

“F3=c”

IBM 7

Find Maximal Bicliques

• Biclique:– Subgraph of bipartite graph– All people are connected to all feature values

• Maximal biclique:– Cannot add person to biclique without losing

feature value– Cannot add feature value to biclique without

losing people

IBM 8

Adjacency RepresentationAbstract D1 D2 D3

Actual P1 P2 P3

S1 “F1=a” 1 0 1

S2 “F1=b” 0 0 0

S3 “F1=c” 0 1 0

S4 “F2=a” 1 1 0

S5 “F2=b” 0 0 1

S6 “F2=c” 0 0 0

S7 “F3=a” 1 0 0

S8 “F3=b” 0 0 0

S9 “F3=c” 0 1 1

IBM 9

Start With Singletons

D1 D2 D3

S1 1 0 1

S2 0 0 0

S3 0 1 0

S4 1 1 0

S5 0 0 1

S6 0 0 0

S7 1 0 0

S8 0 0 0

S9 0 1 1

IBM 10

Save Acceptable Candidates

• Candidate is acceptable (so far) if:– Biclique [{S},{D}] is maximal– {S} has “enough” elements– {D} has “enough” elements– Score (figure of merit) is “good enough”

• Thresholds for {S}, {D}, score, set as parameters• Candidates saved in priority queue by score

IBM 11

Compute Neighbor Set

D1 D2 D3Candidate S1 1 0 1Neighbor S4 1 1 0Neighbor S5 0 0 1Neighbor S7 1 0 0Neighbor S9 0 1 1

• For candidate C = [{Sc},{Dc}], neighbor set {Nc} is the set of feature values in the original data that has at least one person in {Dc}

• For singleton candidate [{S1},{D1,D3}], the neighbor set is {S4, S5, S7, S9}

IBM 12

Expand Candidates

1. Pop “current best” candidate Co off priority queue

2. Create new candidates from neighbors {NCi} by taking unions and intersections:

• SCi = {SCo} U {SNi}• DCi = {DCo} ∩ {DNi}

3. Save acceptable candidates in queue

IBM 13

Expansion Example

P1 P2 P3

Candidate 0 S1 1 0 1

Neighbor i S4 1 1 0

New Candidate i S1, S4 1 0 0

Neighbor j S9 0 1 1

New Candidate j S1, S9 0 0 1

IBM 14

F.O.M. And The Priority Queue

• Various criteria can be used to calculate a figure of merit (score)

• Working queue size set as parameter

• Working queue is allowed to fill up

• Buffer is emptied when trigger reached

Top

Cand 1

Cand 2

…

Bottom

Buffer

Cand n

…

Trigger

F.O.M.

IBM 15

F.O.M. And Search Strategy

• The search strategy is embodied in the evaluation of the “<“ operator for bicliques

• The candidate queue is prioritized in the same order as the scores

• The scoring function can be externalized

• The search strategy can be changed without changing the search machinery

IBM 16

Scoring Case/Control Problems

Cases Controls Total

Match a b a+b

No match c d c+d

Total Ncases Ncontrols Ntotal

Given Measured Derived

IBM 17

Scoring Example

a b a+b

c d c+d

Ncases Ncontrols Ntotal

( a * d )Odds Ratio (OR) = ( b * c )

FOM = abs ( log ( OR ) )

IBM 18

Statistical Significance

a b a+b

c d c+d

Ncases Ncontrols Ntotal

FOMadj = FOM * ( 1 – q )

(a+b)! (c+d)! (Ncases)! (Ncontrols)!p = a! b! c! d! (Ntotal)!

q = ∑ p, for all tables with same margins and better FOM

IBM 19

Structure Of The Output

• Algorithm yields a collection of related patterns (bicliques)

• Question of when to “lump” and when to “split” related patterns

• Lattice structure helps us decide

IBM 20

Lattice (simplified)

• A lattice can be represented as a graph with special properties (Hasse diagram)

• In the context of bicliques, each node B is characterized by 2 sets: S and D

• A directed edge exists from node B1 to B2 if and only if– S1 is a subset of S2 and– D2 is a subset of D1

IBM 21

Lattice ExampleNull; 1,2,3,4

A; 1,2,3 B; 2,3,4

B,C; 3,4

A,B,C,D; Null

C; 1,3,4

A,B; 2,3

A,B,C; 3 B,C,D; 4

A,C; 1,3 C,D; 1,4

A,C,D; 1

IBM 22

Score jumpLattice

IBM 23

Real SNP Data (a few rows)

11581 AG AA CG AA AA CC AA GG AA AG 67

11643 GG AG CG CA AA CT AA GG AA GG 67

11644 GG AG CG AA AA CT AA GG AA GG 67

11647 GG AG CC AA AA CC AA GG AA GG 66

11657 GG AA CG AA AA CC AA GG AA AG 67

11660 GG AG CG CC AA TT AA Un AA AG 67

11664 GG AG CC Un AA CC AA GA AA AG 67

11665 GG AA CG CA AG CT AA GG AA AG 66

11666 GG AG CG AA AA CT AA GG AA GG 66

11668 GG AA CG AA AA TT AA GG AA GG 66

11669 GG AG CG CA AA CT AA GA AA GG 66

IBM 24

SNPLattice

IBM 25

“Reading” The Lattice

Small NsLarge NdLow FOM

+S -D

+S” -D”

HighFOM

Higher FOM

Background context

Effect of adding Sin the context of

Lower FOM

+S’-D’

Children of may have better or worse scores, but are “similar” to

Score jump

Scores similar

ibm1 an algorithm for exploring patterns in clinical genomic data richard mushlin and aaron...

Documents

c slide

queue slide

score slide

search slide

value slide

c011 slide

s9001 slide

score patterns