framework for sequence cluster merging (also showing importance of domain knowledge) arvind gopu...
TRANSCRIPT
![Page 1: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/1.jpg)
Framework for Sequence Cluster Merging (Also showing importance of domain knowledge)
Arvind Gopu
Masters student, Computer Science & Bioinformatics
Indiana University, Bloomington
http://biokdd.informatics.indiana.edu/~agopu
Email: [email protected]
![Page 2: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/2.jpg)
Introduction
Sequence Clustering very important research topic. Bottom-up approach – basically merge elements
recursively upto certain specificity Top-down approach – split elements until desired
specificity is achieved Two important issues: selectivity and sensitivity
Sequence clustering problem is unique No “observable” attributes unlike most clustering problems Example:
Supermarket: Soda, Fruit juice, Frozen foods, Clothing, etc. Demographic: Height, Race, etc.
Sequence clustering: Just a bunch of amino acid characters! (with accompanying well studied sequence comparison/alignment programs).
![Page 3: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/3.jpg)
Introduction …
Getting back to sequence clustering… Fragmentation problem – well known in sequence
clustering algorithms. Example: BAG (Sun Kim) 99 % accuracy (selective) but at cost of ~40-50 %
fragmentation (over-sensitive) Solution?
Bottom-Up merging back of fragmented clusters
![Page 4: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/4.jpg)
Need for framework
Suggested bottom-up approach possible using various sub-methods Framework: Do common and unique tasks
seamlessly Insert new sub-methods easily with very little
hassle Implemented primarily in Perl with supporting
C programs and Unix Shell scripts
![Page 5: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/5.jpg)
Framework Schematic
Test Merge’bility
Merge Suggestions
from Clustering Algorithm
Prepare Sequence Data
Post-process New Clustering
Result
Generate Combined Profile for Two
Fragment Clusters
Enhanced Clustering
Result
Test Scaffold
![Page 6: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/6.jpg)
Framework – Profile Generation
Test Merge’bility
Merge Suggestions
from Clustering Algorithm
Prepare Sequence Data
Post-process New Clustering
Result
GENERATE COMBINED PROFILE
FOR TWO FRAGMENT CLUSTERS
Enhanced Clustering
Result
Test Scaffold
![Page 7: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/7.jpg)
Profile Generation – MSA
MSA (C1)
MSA (C2)
MSA (C1, C2)
C1
C2
Combined Profile
MSA = Multiple Sequence Alignment
![Page 8: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/8.jpg)
Profile Generation – MSA
Common first step: MSA profile generation for two fragment clusters C1 and C2 (Clustalw) MSA (C1) and MSA (C2) Most expensive step in framework
Common second step: Combined profile generation (Clustalw) Prof_Align [MSA (C1), MSA (C2)]
![Page 9: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/9.jpg)
Profile Generation – MSA explained.. All of the implemented techniques depend on MSA
profiles MSA profile: align more than 2 sequences
simultaneously
Image from http://bioinformatics.weizmann.ac.il/~pietro/Making_and_using_protein_MA/
![Page 10: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/10.jpg)
Profile Generation – MSA explained..
Image from http://www.mscs.mu.edu/~cstruble/class/mscs230/fall2002/notes/3
![Page 11: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/11.jpg)
Framework – Merge’bility Test
TEST MERGE’BILITY
Merge Suggestions
from Clustering Algorithm
Prepare Sequence Data
Post-process New Clustering
Result
Generate Combined Profile for Two
Fragment Clusters
Enhanced Clustering
Result
Test Scaffold
![Page 12: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/12.jpg)
Model Comparison based Merge Test
![Page 13: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/13.jpg)
Model Comparison based Merge Test Statistics/Machine learning technique based
method: Uses Relative Entropy and Statistical measures
w.r.t. Runs test Drawbacks
Almost impossible to nail down on threshold values for z-score or any other statistical measure
Extremely dependent sample size equality – does not work well when the two fragment sizes vary
![Page 14: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/14.jpg)
Model Comparison based Merge Test Each column in a MSA profile is a probabilistic
model (details of construction beyond the scope of this talk)
Compute similarity between corresponding columns in the two fragments – Kullback Liebler distance Need to consider gaps while matching up columns –
challenging task Also need to screen for random “good” distances – taken
care off using random model in distance computation
![Page 15: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/15.jpg)
Model Comparison based Merge Test
![Page 16: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/16.jpg)
Model Comparison based Merge Test Using column wise comparison distance
scores, compute “distance vector” Symbolic representation for “good”, “bad” and
“don’t care” distances (detail abstracted) Do standard statistical test: Runs test to
check out how random distance vector is… Nice pattern:
y | y | y | n | n | y | y | y | n | n | y | y | y Random pattern:
y | n | y | y | n | n | n | y | n | y | n | n | y | y
![Page 17: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/17.jpg)
Model Comparison based Merge Test
4) Do Runs test
![Page 18: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/18.jpg)
Model Comparison based Merge Test Compute mean, standard deviation and
subsequently z-score Threshold to separate “good” and “bad” merges
Drawbacks again… Threshold will be sample specific, hard to have
one threshold for entire dataset (illustrated in test results)
Failure rate is high if sample size is unequal
![Page 19: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/19.jpg)
Phylogenetic Tree based Merge Test
![Page 20: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/20.jpg)
Merge’bility Test – Techniques … Phylogenetic tree based method:
Evolutionary Distance based method Drawback: Too strict; many false negatives possible;
Also hard to nail a threshold Evolutionary Least Common Ancestor (LCA)
based method Improved performance in both of the previously
mentioned issues
![Page 21: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/21.jpg)
Phylogenetic TreeEvolutionary Distance based
Merge Test
![Page 22: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/22.jpg)
Phylogenetic Tree Distance based method Clustalw (or other tree generation tools)
provide NJ tree of a MSA profile Sequence length normalized distance from
root for each sequence 0 < distance < 1
Define some threshold for distance that constitutes intra/inter cluster distances
![Page 23: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/23.jpg)
Phylogenetic Tree Distance based method Distance between sequences from…
Two clusters will be closer to: ‘1’ if two clusters are not merge’ble – call these “bad
distances” ‘0’ if two clusters are actually part of the same super
cluster The same cluster will be obviously closer to ‘0’ –
these constitute “good distances”; don’t care in our case
Count number of “bad distances” Gives a good idea of how good a merge is
![Page 24: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/24.jpg)
Phylogenetic Tree Distance based method Good enough? Not
yet – need for normalization of the “bad distance” count. Why? Number of edges
between vertices of same/different clusters is proportional to size of clusters!
![Page 25: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/25.jpg)
Phylogenetic Tree Distance based method Once normalization of number of “bad
distances” is done, this method churned out decent results Normalizing factor? Contentious.. What is a good
normalizer? Method too strict for unequally sized clusters.
Most merges rejected leading to appreciable number of false negatives Inherent nature of MSA programs and unequally sized
profiles (cluster sizes)
![Page 26: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/26.jpg)
Phylogenetic TreeLCA Coverage based Merge
Test
![Page 27: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/27.jpg)
Phy.Tree LCA coverage based method Clustalw, Phylip (or other tree generation
tools) provide a rooted phylogenetic tree for a MSA profile
Looking at the tree, one can easily make out if a pair of clusters should be merged or not How? Parse tree into a usual tree data structure and
look for common ancestor of sequences of each cluster
Example…
![Page 28: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/28.jpg)
Phy.Tree LCA coverage based method Good Merge
Sequences of the two clusters (shaded blue and red) are from the same super cluster
![Page 29: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/29.jpg)
Phy.Tree LCA coverage based method Bad Merge
Sequences of the two clusters (shaded blue and red) are from different super clusters
![Page 30: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/30.jpg)
Phy.Tree LCA coverage based method Same LCA for both clusters? Good merge! If not … Bad merge?
Not quite. Possible that LCAs may be different but they cover sequences from either cluster upto a considerable extent
Better to use coverage of LCAs instead Example…
![Page 31: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/31.jpg)
Phy.Tree LCA coverage based method Why LCA Coverage?
Second cluster has three sequences, but its LCA covers four more sequences from the other cluster
![Page 32: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/32.jpg)
Phy.Tree LCA coverage based method Coverage test:
For clusters Ci and Ck, choose smaller cluster say Ci i.e | Ci | < | Ck |
Define Cov (LCA[Ci]) as the number of sequences LCA Ci covers.
If Cov(LCA[Ci]) > # of sequences in Ci
… where | Ci | < | Ck | i.e. { Cov (LCA[Ci]) / | Ci | } > 1
Or {Cross Coverage (LCA[Ci])} > 0
![Page 33: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/33.jpg)
Phy.Tree LCA coverage based method Advantages:
Sample size difference does not play a big role Demarcating between “good” and “bad” merges is
much simpler and straight forward Shown to work really well on a variety of data
sizes, difficulty levels – test results… Possible weakness:
Bound to fail for extremely small fragments (say 2 sequences each) – hard not to have a common LCA !
![Page 34: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/34.jpg)
Test Results – 4 datasets(from COG database)
![Page 35: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/35.jpg)
Test Results – Data set 1
DATA: COG {0001, 0005} (Real Size: 35,30) MERGE’BILITY TEST METHOD
Observed OutcomeFragment Cluster Size Expected Outcome
n (F1) n (F2) Good / Bad Model Comparison(0.0001)
Phy.tree Distance Phy.tree LCA coverage
10 10 Good Good Good Good
10 10 Bad Bad Bad Bad
10 5 Good Good Good Good
10 5 Bad Bad Bad Bad
10 3 Good Good Good Good
10 3 Bad Bad Bad Bad
4 2 Good Good Good Good
4 2 Bad Bad Bad Bad
3 3 Good Good Good Good
3 3 Bad Bad Bad Bad
![Page 36: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/36.jpg)
Test Results – Data set 2
DATA: COG {0142, 0183} (Real Size: 74,116) MERGE’BILITY TEST METHOD
Observed OutcomeFeagment Cluster Size Expected Outcome
n (F1) n (F2) Good / Bad Model Comparison(0.001)
Phy.tree Distance Phy.tree LCA coverage
10 10 Good Good Good Good
10 10 Bad Bad Bad Bad
10 5 Good Good Bad Good
10 5 Bad Bad Bad Bad
10 3 Good Good Bad Good
10 3 Bad Good Bad Bad
4 2 Good Good Bad Good
4 2 Bad Bad Bad Bad
3 3 Good Good Bad Bad
3 3 Bad Bad Bad Bad
![Page 37: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/37.jpg)
Test Results – Data set 3
DATA: COG {0380, 0383} (Real Size: 15,13) MERGE’BILITY TEST METHOD
Observed OutcomeFragment Cluster Size Expected Outcome
n (F1) n (F2) Good / Bad Model Comparison(0.001 / 0.0005)
Phy.tree Distance Phy.tree LCA coverage
10 10 Good Good / Bad Good Good
10 10 Bad Good / Bad Bad Bad
10 5 Good Good / Bad Bad Good
10 5 Bad Good / Bad Bad Bad
10 3 Good Good / Bad Bad Good
10 3 Bad Good / Bad Bad Bad
4 2 Good Good / Good Good Bad
4 2 Bad Bad / Good Bad Bad
3 3 Good Good / Bad Good Good
3 3 Bad Bad / Bad Bad Good
![Page 38: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/38.jpg)
Test Results – Data set 4DATA: COG {0160, 0161} (Real Size: 79,49) MERGE’BILITY TEST METHOD
Observed OutcomeFragment Cluster Size Expected Outcome
n (F1) n (F2) Good / Bad Model Comparison(0.0001)
Phy.tree Distance Phy.tree LCA coverage
10 10 Good Bad Good Good
10 10 Bad Good Good Bad
10 5 Good Bad Good Good
10 5 Bad Good Good Bad
10 3 Good Good Good Good
10 3 Bad Good Bad Bad
4 2 Good Good Good Good
4 2 Bad Good Good Bad
3 3 Good Good Good Good
3 3 Bad Good Good Bad
2 2 Good Good Good Good
2 2 Bad Good Good Good
![Page 39: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana](https://reader030.vdocument.in/reader030/viewer/2022033108/56649ec05503460f94bcc124/html5/thumbnails/39.jpg)
Acknowledgements!
A big thank you to: Prof. Sun Kim, advisor My parents, brother, grand parents! All my colleagues and friends: JH, Zhiping, Scott Martin,
SR, Raj, Anshul, Pat Hayes and everyone else! Folks at CS & Informatics: CS Systems staff, Lucy, Linda,
Wendy, Cheryl, Errissa, Bob! Profs. Marty Siegel and Gary Wiggins – GPC. RATS folks!
Did I forget someone?! Sorry if I did…