ibm1 an algorithm for exploring patterns in clinical genomic data richard mushlin and aaron...
Post on 21-Dec-2015
217 views
TRANSCRIPT
IBM 1
An Algorithm ForExploring Patterns In
Clinical Genomic Data
Richard Mushlin and
Aaron Kershenbaum
IBM T.J. Watson Research Center
IBM 2
Some Questions
1. What do people with a disease (cases) have in common that people without the disease (controls) don’t?
2. How “exact” is the answer?
3. What does the answer tell us about the disease etiology?
IBM 3
Our Approach
1. Find patterns in cases
2. Compare frequencies with same patterns in controls
3. Score patterns
4. Find relationships between patterns
5. Evaluate effect of patterns on biological pathways
IBM 4
Find Patterns
• Combination of– Brute force– Thresholds– Directed search
IBM 5
Raw Data
Feature1 Feature2 Feature3 Feature..
Person1 “F1=a” “F2=a” “F3=a” …
Person2 “F1=a” “F2=a” “F3=c” …
Person3 “F1=a” “F2=b” “F3=c” …
Person.. … … … “value”
IBM 6
Bipartite Graph Representation
People Feature values
P1 “F1=a”
P2
P3
“F2=a”
“F2=b”
“F3=a”
“F3=c”
IBM 7
Find Maximal Bicliques
• Biclique:– Subgraph of bipartite graph– All people are connected to all feature values
• Maximal biclique:– Cannot add person to biclique without losing
feature value– Cannot add feature value to biclique without
losing people
IBM 8
Adjacency RepresentationAbstract D1 D2 D3
Actual P1 P2 P3
S1 “F1=a” 1 0 1
S2 “F1=b” 0 0 0
S3 “F1=c” 0 1 0
S4 “F2=a” 1 1 0
S5 “F2=b” 0 0 1
S6 “F2=c” 0 0 0
S7 “F3=a” 1 0 0
S8 “F3=b” 0 0 0
S9 “F3=c” 0 1 1
IBM 9
Start With Singletons
D1 D2 D3
S1 1 0 1
S2 0 0 0
S3 0 1 0
S4 1 1 0
S5 0 0 1
S6 0 0 0
S7 1 0 0
S8 0 0 0
S9 0 1 1
IBM 10
Save Acceptable Candidates
• Candidate is acceptable (so far) if:– Biclique [{S},{D}] is maximal– {S} has “enough” elements– {D} has “enough” elements– Score (figure of merit) is “good enough”
• Thresholds for {S}, {D}, score, set as parameters• Candidates saved in priority queue by score
IBM 11
Compute Neighbor Set
D1 D2 D3Candidate S1 1 0 1Neighbor S4 1 1 0Neighbor S5 0 0 1Neighbor S7 1 0 0Neighbor S9 0 1 1
• For candidate C = [{Sc},{Dc}], neighbor set {Nc} is the set of feature values in the original data that has at least one person in {Dc}
• For singleton candidate [{S1},{D1,D3}], the neighbor set is {S4, S5, S7, S9}
IBM 12
Expand Candidates
1. Pop “current best” candidate Co off priority queue
2. Create new candidates from neighbors {NCi} by taking unions and intersections:
• SCi = {SCo} U {SNi}• DCi = {DCo} ∩ {DNi}
3. Save acceptable candidates in queue
IBM 13
Expansion Example
P1 P2 P3
Candidate 0 S1 1 0 1
Neighbor i S4 1 1 0
New Candidate i S1, S4 1 0 0
Neighbor j S9 0 1 1
New Candidate j S1, S9 0 0 1
IBM 14
F.O.M. And The Priority Queue
• Various criteria can be used to calculate a figure of merit (score)
• Working queue size set as parameter
• Working queue is allowed to fill up
• Buffer is emptied when trigger reached
Top
Cand 1
Cand 2
…
Bottom
Buffer
Cand n
…
Trigger
F.O.M.
IBM 15
F.O.M. And Search Strategy
• The search strategy is embodied in the evaluation of the “<“ operator for bicliques
• The candidate queue is prioritized in the same order as the scores
• The scoring function can be externalized
• The search strategy can be changed without changing the search machinery
IBM 16
Scoring Case/Control Problems
Cases Controls Total
Match a b a+b
No match c d c+d
Total Ncases Ncontrols Ntotal
Given Measured Derived
IBM 17
Scoring Example
a b a+b
c d c+d
Ncases Ncontrols Ntotal
( a * d )Odds Ratio (OR) = ( b * c )
FOM = abs ( log ( OR ) )
IBM 18
Statistical Significance
a b a+b
c d c+d
Ncases Ncontrols Ntotal
FOMadj = FOM * ( 1 – q )
(a+b)! (c+d)! (Ncases)! (Ncontrols)!p = a! b! c! d! (Ntotal)!
q = ∑ p, for all tables with same margins and better FOM
IBM 19
Structure Of The Output
• Algorithm yields a collection of related patterns (bicliques)
• Question of when to “lump” and when to “split” related patterns
• Lattice structure helps us decide
IBM 20
Lattice (simplified)
• A lattice can be represented as a graph with special properties (Hasse diagram)
• In the context of bicliques, each node B is characterized by 2 sets: S and D
• A directed edge exists from node B1 to B2 if and only if– S1 is a subset of S2 and– D2 is a subset of D1
IBM 21
Lattice ExampleNull; 1,2,3,4
A; 1,2,3 B; 2,3,4
B,C; 3,4
A,B,C,D; Null
C; 1,3,4
A,B; 2,3
A,B,C; 3 B,C,D; 4
A,C; 1,3 C,D; 1,4
A,C,D; 1
IBM 22
Score jumpLattice
IBM 23
Real SNP Data (a few rows)
11581 AG AA CG AA AA CC AA GG AA AG 67
11643 GG AG CG CA AA CT AA GG AA GG 67
11644 GG AG CG AA AA CT AA GG AA GG 67
11647 GG AG CC AA AA CC AA GG AA GG 66
11657 GG AA CG AA AA CC AA GG AA AG 67
11660 GG AG CG CC AA TT AA Un AA AG 67
11664 GG AG CC Un AA CC AA GA AA AG 67
11665 GG AA CG CA AG CT AA GG AA AG 66
11666 GG AG CG AA AA CT AA GG AA GG 66
11668 GG AA CG AA AA TT AA GG AA GG 66
11669 GG AG CG CA AA CT AA GA AA GG 66
IBM 24
SNPLattice
IBM 25
“Reading” The Lattice
Small NsLarge NdLow FOM
+S -D
+S” -D”
HighFOM
Higher FOM
Background context
Effect of adding Sin the context of
Lower FOM
+S’-D’
Children of may have better or worse scores, but are “similar” to
Score jump
Scores similar