scalable parallel algorithms for clustering (--this...

RESEARCH POSTER PRESENTATION DESIGN © 2011

www.PosterPresentations.com

(--THIS SECTION DOES NOT PRINT--)

This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly asked questions specific to this template. If you are using an older version of PowerPoint some template features may not work properly.

Verifying the quality of your graphics Go to the VIEW menu and click on ZOOM to set your preferred magnification. This template is at 100% the size of the final poster. All text and graphics will be printed at 100% their size. To see what your poster will look like when printed, set the zoom to 100% and evaluate the quality of all your graphics before you submit your poster for printing.

Using the placeholders To add text to this template click inside a placeholder and type in or paste your text. To move a placeholder, click on it once (to select it), place your cursor on its frame and your cursor will change to this symbol: Then, click once and drag it to its new location where you can resize it as needed. Additional placeholders can be found on the left side of this template.

Modifying the layout This template has four different column layouts. Right-click your mouse on the background and click on “Layout” to see the layout options. The columns in the provided layouts are fixed and cannot be moved but advanced users can modify any layout by going to VIEW and then SLIDE MASTER.

Importing text and graphics from external sources TEXT: Paste or type your text into a pre-existing placeholder or drag in a new placeholder from the left side of the template. Move it anywhere as needed. PHOTOS: Drag in a picture placeholder, size it first, click in it and insert a photo from the menu. TABLES: You can copy and paste a table from an external document onto this poster template. To adjust the way the text fits within the cells of a table that has been pasted, right-click on the table, click FORMAT SHAPE then click on TEXT BOX and change the INTERNAL MARGIN values to 0.25

Modifying the color scheme To change the color scheme of this template go to the “Design” menu and click on “Colors”. You can choose from the provide color combinations or you can create your own.

(--THIS SECTION DOES NOT PRINT--)

This PowerPoint 2007 template produces a 36”x48” professional poster. It will save you valuable time placing titles, subtitles, text, and graphics.

Use it to create your presentation. Then send it to PosterPresentations.com for premium quality, same

We provide a series of online tutorials that will guide you through the poster design process and

(copy and paste the link into your web browser).

For assistance and to order your printed poster call PosterPresentations.com at 1.866.649.3004

Use the placeholders provided below to add new elements to your poster: Drag a placeholder onto the poster area, size it, and click it to edit.

Move this preformatted section header placeholder to the poster area to add another section header. Use section headers to separate topics or concepts

Move this preformatted text placeholder to the

Move this graphic placeholder onto your poster, size it first, and then click it to add a picture to the

©"2011"PosterPresenta.ons.com"""""2117"Fourth"Street","Unit"C"""""Berkeley"CA"94710"""""[email protected]

Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.

Clustering is one of the advanced analytical functions that has an immense potential to transform the knowledgespace when applied particularly to a data-rich domain such as computational biology. The computational hardness of the underlying theoretical problem necessitate use of heuristics in practice. In this study, we evaluate the application of a randomized sampling-based heuristic called shingling on unweighted biological graphs and propose a new variant of this heuristic that extends its application to weighted graph inputs and is better positioned to achieve qualitative gains on unweighted inputs as well. We also present parallel algorithms for this heuristic using the MapReduce paradigm. Experimental results on subsets of a medium-scale real world input (containing up to 10.3M vertices and 640M edges) demonstrate significant qualitative improvements in the reported clustering, both with and without using edge weights. Furthermore, performance studies indicate near-linear scaling on up to 4K cores of a distributed memory supercomputer.

ABSTRACT

OBJECTIVES

METHODS

EXPERIMENTAL RESULTS CONTRIBUTIONS

ACKNOWLEDGMENT and CONTACTS US Department of Energy DE-SC-0006516 National Science Foundation IIS 0916463 SC’13 Student Volunteer Scholarship ACM Microsoft Research Travel Award This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231

Inna Rytsareva: [email protected] Ananth Kalyanaraman: [email protected]

Scalable Parallel Algorithms for Clustering Large-Scale Biological Graphs

School of Electrical Engineering and Computer Science, Washington State University Pullman, WA, USA

Inna Rytsareva, Ananth Kalyanaraman (Advisor)

http://www.mkbergman.com/

New variant of the Shingling heuristic based on a normalization technique

Reduction of # singletons

Ability to handle unweighted and weighted graphs

MapReduce algorithm for parallelizing the new variant Extensive experimental results qualitative improvement over the standard heuristic performance improvement of the MPI implementation over the

Hadoop implementation

…

Mappers

…

Reducers

A D J A C E N C Y

L I S T

G(Vl,Vr,E)

Sort by Shingle

ID

T R A G N R S A F P O H R M E D

GI(Vs,Vl,E’)

Loop back to next shingling phase

u:{v1,v2,… vk}

c shingles for u Adj. list

for each

shingle

Characterizing protein families in environmental microbial communities

Already known protein seq. & families

Assemble DNA & Predict genes

Protein family assignment & discovery

Community function inference

Translate into ORFs

? ~50x106 clusters x108 sequences

x106-8 new sequences

Multiple community data ~350 metagenomics projects (as of today)

…

n sequences

Homology Graph Dense subgraphs to clusters

All vs. All sequence

comparison

Remove redundant sequences

Community annotation

Sequence homology graphs:

Wu & Kalyanaraman, 2008

Theoretical formulations are NP-Hard (Feige et al., 2001) Need for efficient heuristics

Metrics for quality evaluation

To allow for overlaps or not

Lack of parallel tools Developments recent/ongoing (e.g., 10th DIMACS challenge)

Qualitative Assessment

Input Data set, Algorithm

Quality(metrics( Clustering(statistics(

Modularity Density # non-

singleton clusters

% of singletons

Largest cluster size

UBC- 25M

Unweighted, Standard 0.8454 0.3126±0.2506 27,479 35.15% 21,973

Unweighted, Normalized 0.9928 0.5603±0.3866 95,505 2.39% 25,827

Weighted, Normalized 0.9937 0.5603±0.3866 95,505 2.12% 25,864

Weighted, Louvain 0.9695 0.5309±0.3864 116,558 0% 13,297

Weighted, MCL 0.9383 0.5259±0.3837 118,518 0% 5,507

Input: Unweighted UBC subgraphs. Run-times are for our MPI-MapReduce implementation.

Performance Study Input graph

Run-time (in seconds) using p cores

p=64 p=128 p=256 p=512 p=1024 p=2048 p=4096

UBC-25M 104.4 59.81 37.21 26.66 17.5 14.8

UBC-50M 159.87 100.43 50.89 32.66 25.96 17.84 20.69

UBC-100M 158.22 90.74 53.51 36.48 27.88 24.86

UBC-200M 110.57 60.23 43.66 30.53 31.14

UBC-400M 121.81 73.91 36.71 25.53

UBC-640M 102.49 70.78 36.08

MapReduce implementation matters – MPI-MapReduce implementation is at least two orders of magnitude faster than Hadoop’s

Unweighted vs. weighted analysis takes about the same time Adjacency list implementation is about 3x faster than edge list implementation

Performance observations

Experimental platform: Hopper (NERSC-6): Cray XE6 ◦ 6,392 compute nodes, 153,216 cores ◦ 32 GB RAM per node ◦ Custom mpich-2 version 5.5.5 for Cray XE ◦ MapReduce-MPI library

Experimental Setup

Input label Number of edges

Number of vertices

UBC-25M 25 x 106 3,965 x 103

UBC-50M 50 x 106 4,525 x 103

UBC–100M 100 x 106 6,795 x 103

UBC-200M 200 x 106 7,336 x 103

UBC-400M 400 x 106 8,958 x 103

UBC-640M 640 x 106 10,346 x 103

environmental microbial communities

Randomly permute

Randomly permute

If same then u and v are related

c trials

u

v

s, c: parameters

s elems shingle shingle s elems

… …

…

u v

… …

Large number of singletons

Inability to handle edge weights e.g., degree of sequence similarity

Scaling to very large data sets Parallelization required for reducing time to solution

Cause: Low degree vertices tend to be left out by Shingling sampling approach

Two contrasting scenarios for low degree vertices:

(+)ve case: Periphery (-)ve case: Chimeric

Step 1) Normalize edge weight for every edge (if not already normalized)

€

wu,vi=

wu,vi

wu,v1+ ...+ wu,vi

+ ...+ wu,vn

∗C

C – normalizing constant

Step 2) Generate wu,vi copies for edge (u, vi)

Perform shingling on the multigraph

http://lamar.colostate.edu/~jvivanco/interactions.htm http://www.wageningenur.nl/en/show/Isolation-and-characterization-of-novel-and-shortchain-fatty-acid-producing-bacteria-in-the-anaerobic-human-gut..htm

http://www.nersc.gov/assets/About-Us/hopper1.jpg

…

…

…

…

shingle

1 1st pass

G(V, V, E)

1st s

hing

les

…

… …

…

2 2nd pass

GI(Vs, V, E)

…

…

…

…

Den

se

subg

raph

s

3 A≅B

GII(Vs, Vt, E)

A

B 2n

d sh

ingl

es

loose tight

REFERENCES 1. Broder, A.Z. et al., (2000), ‘Min-wise independent permutations’, Journal of Computer and System Sciences, Vol. 60, pp.630–659. 2. Gibson, D., Kumar, R. and Tomkins, A. (2005), ‘Discovering large dense subgraphs in massive graphs’, Proceedings of the International Conference on Very Large Data Bases, pp.721–732. 3. Rytsareva, I., Kalyanaraman, A. (2013), ’Scalable heuristics for clustering biological graphs’, 3rd IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pp.1-6. 4. Wu, C. and Kalyanaraman, A. (2008), ‘An efficient parallel approach for identifying protein families in large-scale metagenomic data sets’, Proceedings ACM/IEEE conference on Supercomputing, pp.1–10.

SC’13 ACM Student Research Competition, Denver, CO, 2013

Inna Rytsareva

Inna Rytsareva

scalable parallel algorithms for clustering (--this...

Documents