take a walk and cluster genes: a tsp-based approach to optimal rearrangement clustering sharlee...

24
Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering Sharlee Climer and Weixiong Zhang This research was supported in part by NDSEG and Olin Fellowships and by NSF grants IIS-0196057 and ITR/EIA-0113618.

Upload: joleen-smith

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering Sharlee Climer and Weixiong Zhang This research was supported in part by NDSEG and Olin Fellowships and by NSF grants IIS-0196057 and ITR/EIA-0113618.
  • Slide 2
  • Sharlee Climer Washington University in St. Louis 2 Overview Introduction Example Results Conclusion
  • Slide 3
  • Sharlee Climer Washington University in St. Louis 3 Introduction Rearrangement clustering Rearrange rows of a matrix Minimize the sum of the differences between adjacent rows min d(i, i+1) Rows correspond to objects Columns correspond to features
  • Slide 4
  • Sharlee Climer Washington University in St. Louis 4 Introduction Applications Information retrieval Manufacturing Software engineering
  • Slide 5
  • Sharlee Climer Washington University in St. Louis 5 Example
  • Slide 6
  • Sharlee Climer Washington University in St. Louis 6 Example Bond Energy Algorithm (BEA) Introduced in 1972 (McCormick, Schweitzer, White) Approximate solution Still widely used
  • Slide 7
  • Sharlee Climer Washington University in St. Louis 7 Example
  • Slide 8
  • Sharlee Climer Washington University in St. Louis 8 Example Optimal solution Lenstra (1974) observed equivalence to the Traveling Salesman Problem (TSP) Given n cities and the distance between each pair Find shortest cycle visiting every city NP-hard problem
  • Slide 9
  • Sharlee Climer Washington University in St. Louis 9 Example Transform into a TSP Each object corresponds to a city Distance between two cities equal to difference between the corresponding objects Dummy city added to problem Costs from dummy city to all other cities equal a constant Location of dummy city indicates position to cut cycle into a path
  • Slide 10
  • Sharlee Climer Washington University in St. Louis 10 Example TSP solvers extremely slow even for small problems in the 70s Massive research efforts to solve TSP over last three decades Current solvers Concorde (Applegate, Bixby, Chvatal, Cook, 2001) Solved a 15,112 city TSP
  • Slide 11
  • Sharlee Climer Washington University in St. Louis 11 Example
  • Slide 12
  • Sharlee Climer Washington University in St. Louis 12 Example BEA and TSP offer approximate and optimal solutions We have observed a flaw in the objective function when the objects form natural clusters The objective minimizes the sum of every pair of adjacent rows Inter-cluster distances tend to be significantly larger than intra-cluster distances Summation dominated by inter-cluster distances
  • Slide 13
  • Sharlee Climer Washington University in St. Louis 13 Example TSPCluster addresses this flaw Add k dummy cities k clusters are specified by the output TSP solver ignores inter-cluster distances Minimizes sum of intra-cluster distances Use sufficiently small constant for distances to/from dummy cities Dummy cities never adjacent to each other
  • Slide 14
  • Sharlee Climer Washington University in St. Louis 14 Example
  • Slide 15
  • Sharlee Climer Washington University in St. Louis 15 Results Arabidopsis 499 genes 25 conditions Comparison with BEA Used BEA similarity measure BEA score: 447,070 TSPCluster score: 452,109 (k = 1)
  • Slide 16
  • Sharlee Climer Washington University in St. Louis 16 Results BEATSPCluster
  • Slide 17
  • Sharlee Climer Washington University in St. Louis 17 Results Compared with Cluster (Eisen et al., 1998) and k-ary (Bar-Joseph et al., 2003) Used Pearson correlation coefficient Cluster: 398 k-ary: 427 TSPCluster: 436 (k = 1)
  • Slide 18
  • Sharlee Climer Washington University in St. Louis 18 Results Clusterk-aryTSPCluster
  • Slide 19
  • Sharlee Climer Washington University in St. Louis 19 Results TSPCluster with k equal to 2 to 50 How many clusters? Average inter-cluster distances BEA local peaks: 6, 13, 19, 26, 29, 35, 40, 47 Pearson correlation coefficient local peaks: 3, 9, 12, 21, 26, 40 Computation time varied Less than half minute to ~3 minutes
  • Slide 20
  • Sharlee Climer Washington University in St. Louis 20 Results k = 26k = 40
  • Slide 21
  • Sharlee Climer Washington University in St. Louis 21 Conclusion Most problems have errors in their data Error introduced by approximation algorithms cant be expected to undo this error Computers are cheap Computers and solvers are sophisticated Dont have to always resort on approximate solutions even for NP-hard problems
  • Slide 22
  • Sharlee Climer Washington University in St. Louis 22 Conclusion Rearrangement clustering provides a linear ordering Linear ordering inherent to many applications Information retrieval Manufacturing Software engineering
  • Slide 23
  • Sharlee Climer Washington University in St. Louis 23 Conclusion Gene data arranged in linear order to examine data Linear ordering not necessarily essential to gene clustering problems Current work Optimally solve subproblems in clustering algorithms
  • Slide 24
  • Sharlee Climer Washington University in St. Louis 24 Questions?