RESEARCH POSTER PRESENTATION DESIGN © 2011
www.PosterPresentations.com
(--THIS SECTION DOES NOT PRINT--)
This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly asked questions specific to this template. If you are using an older version of PowerPoint some template features may not work properly.
Verifying the quality of your graphics Go to the VIEW menu and click on ZOOM to set your preferred magnification. This template is at 100% the size of the final poster. All text and graphics will be printed at 100% their size. To see what your poster will look like when printed, set the zoom to 100% and evaluate the quality of all your graphics before you submit your poster for printing.
Using the placeholders To add text to this template click inside a placeholder and type in or paste your text. To move a placeholder, click on it once (to select it), place your cursor on its frame and your cursor will change to this symbol: Then, click once and drag it to its new location where you can resize it as needed. Additional placeholders can be found on the left side of this template.
Modifying the layout This template has four different column layouts. Right-click your mouse on the background and click on “Layout” to see the layout options. The columns in the provided layouts are fixed and cannot be moved but advanced users can modify any layout by going to VIEW and then SLIDE MASTER.
Importing text and graphics from external sources TEXT: Paste or type your text into a pre-existing placeholder or drag in a new placeholder from the left side of the template. Move it anywhere as needed. PHOTOS: Drag in a picture placeholder, size it first, click in it and insert a photo from the menu. TABLES: You can copy and paste a table from an external document onto this poster template. To adjust the way the text fits within the cells of a table that has been pasted, right-click on the table, click FORMAT SHAPE then click on TEXT BOX and change the INTERNAL MARGIN values to 0.25
Modifying the color scheme To change the color scheme of this template go to the “Design” menu and click on “Colors”. You can choose from the provide color combinations or you can create your own.
(--THIS SECTION DOES NOT PRINT--)
This PowerPoint 2007 template produces a 36”x48” professional poster. It will save you valuable time placing titles, subtitles, text, and graphics.
Use it to create your presentation. Then send it to PosterPresentations.com for premium quality, same
We provide a series of online tutorials that will guide you through the poster design process and
(copy and paste the link into your web browser).
For assistance and to order your printed poster call PosterPresentations.com at 1.866.649.3004
Use the placeholders provided below to add new elements to your poster: Drag a placeholder onto the poster area, size it, and click it to edit.
Move this preformatted section header placeholder to the poster area to add another section header. Use section headers to separate topics or concepts
Move this preformatted text placeholder to the
Move this graphic placeholder onto your poster, size it first, and then click it to add a picture to the
©"2011"PosterPresenta.ons.com"""""2117"Fourth"Street","Unit"C"""""Berkeley"CA"94710"""""[email protected]
Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.
Clustering is one of the advanced analytical functions that has an immense potential to transform the knowledgespace when applied particularly to a data-rich domain such as computational biology. The computational hardness of the underlying theoretical problem necessitate use of heuristics in practice. In this study, we evaluate the application of a randomized sampling-based heuristic called shingling on unweighted biological graphs and propose a new variant of this heuristic that extends its application to weighted graph inputs and is better positioned to achieve qualitative gains on unweighted inputs as well. We also present parallel algorithms for this heuristic using the MapReduce paradigm. Experimental results on subsets of a medium-scale real world input (containing up to 10.3M vertices and 640M edges) demonstrate significant qualitative improvements in the reported clustering, both with and without using edge weights. Furthermore, performance studies indicate near-linear scaling on up to 4K cores of a distributed memory supercomputer.
ABSTRACT
OBJECTIVES
METHODS
EXPERIMENTAL RESULTS CONTRIBUTIONS
ACKNOWLEDGMENT and CONTACTS US Department of Energy DE-SC-0006516 National Science Foundation IIS 0916463 SC’13 Student Volunteer Scholarship ACM Microsoft Research Travel Award This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231
Inna Rytsareva: [email protected] Ananth Kalyanaraman: [email protected]
Scalable Parallel Algorithms for Clustering Large-Scale Biological Graphs
School of Electrical Engineering and Computer Science, Washington State University Pullman, WA, USA
Inna Rytsareva, Ananth Kalyanaraman (Advisor)
http://www.mkbergman.com/
New variant of the Shingling heuristic based on a normalization technique
Reduction of # singletons
Ability to handle unweighted and weighted graphs
MapReduce algorithm for parallelizing the new variant Extensive experimental results qualitative improvement over the standard heuristic performance improvement of the MPI implementation over the
Hadoop implementation
…
Mappers
…
Reducers
A D J A C E N C Y
L I S T
G(Vl,Vr,E)
Sort by Shingle
ID
T R A G N R S A F P O H R M E D
GI(Vs,Vl,E’)
Loop back to next shingling phase
u:{v1,v2,… vk}
c shingles for u Adj. list
for each
shingle
Characterizing protein families in environmental microbial communities
Already known protein seq. & families
Assemble DNA & Predict genes
Protein family assignment & discovery
Community function inference
Translate into ORFs
? ~50x106 clusters x108 sequences
x106-8 new sequences
Multiple community data ~350 metagenomics projects (as of today)
…
n sequences
Homology Graph Dense subgraphs to clusters
All vs. All sequence
comparison
Remove redundant sequences
Community annotation
Sequence homology graphs:
Wu & Kalyanaraman, 2008
Theoretical formulations are NP-Hard (Feige et al., 2001) Need for efficient heuristics
Metrics for quality evaluation
To allow for overlaps or not
Lack of parallel tools Developments recent/ongoing (e.g., 10th DIMACS challenge)
Qualitative Assessment
Input Data set, Algorithm
Quality(metrics( Clustering(statistics(
Modularity Density # non-
singleton clusters
% of singletons
Largest cluster size
UBC- 25M
Unweighted, Standard 0.8454 0.3126±0.2506 27,479 35.15% 21,973
Unweighted, Normalized 0.9928 0.5603±0.3866 95,505 2.39% 25,827
Weighted, Normalized 0.9937 0.5603±0.3866 95,505 2.12% 25,864
Weighted, Louvain 0.9695 0.5309±0.3864 116,558 0% 13,297
Weighted, MCL 0.9383 0.5259±0.3837 118,518 0% 5,507
Input: Unweighted UBC subgraphs. Run-times are for our MPI-MapReduce implementation.
Performance Study Input graph
Run-time (in seconds) using p cores
p=64 p=128 p=256 p=512 p=1024 p=2048 p=4096
UBC-25M 104.4 59.81 37.21 26.66 17.5 14.8
UBC-50M 159.87 100.43 50.89 32.66 25.96 17.84 20.69
UBC-100M 158.22 90.74 53.51 36.48 27.88 24.86
UBC-200M 110.57 60.23 43.66 30.53 31.14
UBC-400M 121.81 73.91 36.71 25.53
UBC-640M 102.49 70.78 36.08
MapReduce implementation matters – MPI-MapReduce implementation is at least two orders of magnitude faster than Hadoop’s
Unweighted vs. weighted analysis takes about the same time Adjacency list implementation is about 3x faster than edge list implementation
Performance observations
Experimental platform: Hopper (NERSC-6): Cray XE6 ◦ 6,392 compute nodes, 153,216 cores ◦ 32 GB RAM per node ◦ Custom mpich-2 version 5.5.5 for Cray XE ◦ MapReduce-MPI library
Experimental Setup
Input label Number of edges
Number of vertices
UBC-25M 25 x 106 3,965 x 103
UBC-50M 50 x 106 4,525 x 103
UBC–100M 100 x 106 6,795 x 103
UBC-200M 200 x 106 7,336 x 103
UBC-400M 400 x 106 8,958 x 103
UBC-640M 640 x 106 10,346 x 103
environmental microbial communities
Randomly permute
Randomly permute
If same then u and v are related
c trials
u
v
s, c: parameters
s elems shingle shingle s elems
… …
…
u v
… …
Large number of singletons
Inability to handle edge weights e.g., degree of sequence similarity
Scaling to very large data sets Parallelization required for reducing time to solution
Cause: Low degree vertices tend to be left out by Shingling sampling approach
Two contrasting scenarios for low degree vertices:
(+)ve case: Periphery (-)ve case: Chimeric
Step 1) Normalize edge weight for every edge (if not already normalized)
€
wu,vi=
wu,vi
wu,v1+ ...+ wu,vi
+ ...+ wu,vn
∗C
C – normalizing constant
Step 2) Generate wu,vi copies for edge (u, vi)
Perform shingling on the multigraph
http://lamar.colostate.edu/~jvivanco/interactions.htm http://www.wageningenur.nl/en/show/Isolation-and-characterization-of-novel-and-shortchain-fatty-acid-producing-bacteria-in-the-anaerobic-human-gut..htm
http://www.nersc.gov/assets/About-Us/hopper1.jpg
…
…
…
…
shingle
1 1st pass
G(V, V, E)
1st s
hing
les
…
… …
…
2 2nd pass
GI(Vs, V, E)
…
…
…
…
Den
se
subg
raph
s
3 A≅B
GII(Vs, Vt, E)
A
B 2n
d sh
ingl
es
loose tight
REFERENCES 1. Broder, A.Z. et al., (2000), ‘Min-wise independent permutations’, Journal of Computer and System Sciences, Vol. 60, pp.630–659. 2. Gibson, D., Kumar, R. and Tomkins, A. (2005), ‘Discovering large dense subgraphs in massive graphs’, Proceedings of the International Conference on Very Large Data Bases, pp.721–732. 3. Rytsareva, I., Kalyanaraman, A. (2013), ’Scalable heuristics for clustering biological graphs’, 3rd IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pp.1-6. 4. Wu, C. and Kalyanaraman, A. (2008), ‘An efficient parallel approach for identifying protein families in large-scale metagenomic data sets’, Proceedings ACM/IEEE conference on Supercomputing, pp.1–10.
SC’13 ACM Student Research Competition, Denver, CO, 2013