scalable parallel algorithms for clustering (--this...

1
RESEARCH POSTER PRESENTATION DESIGN © 2011 www.PosterPresentations.com (--THI This Power (version 20 commonly If you are u template fe Verifying t Go to the V preferred m the size of be printed poster will 100% and e before you Using the p To add text placeholde a placehold your cursor to this sym its new loc Additional side of this Modifying t This templa different co Right-click on the back click on “La the layout The column cannot be m layout by g Importing t TEXT: Paste placeholde left side of needed. PHOTOS: D click in it a TABLES: Yo external do adjust the table that h click FORM change the Modifying t To change t the “Design choose from can create INT--) a 36”x48” ble time s. end it to ality, same at will ss and owser). oster call 9.3004 dd new er onto t. aceholder header. concepts o the oster, size o the © 2011 Po 2117 Fou Berkeley posterpr book page. the FB icon. Clustering is one of the advanced analytical functions that has an immense potential to transform the knowledgespace when applied particularly to a data-rich domain such as computational biology. The computational hardness of the underlying theoretical problem necessitate use of heuristics in practice. In this study, we evaluate the application of a randomized sampling-based heuristic called shingling on unweighted biological graphs and propose a new variant of this heuristic that extends its application to weighted graph inputs and is better positioned to achieve qualitative gains on unweighted inputs as well. We also present parallel algorithms for this heuristic using the MapReduce paradigm. Experimental results on subsets of a medium-scale real world input (containing up to 10.3M vertices and 640M edges) demonstrate significant qualitative improvements in the reported clustering, both with and without using edge weights. Furthermore, performance studies indicate near-linear scaling on up to 4K cores of a distributed memory supercomputer. ABSTRACT OBJECTIVES METHODS EXPERIMENTAL RESULTS CONTRIBUTIONS ACKNOWLEDGMENT and CONTACTS US Department of Energy DE-SC-0006516 National Science Foundation IIS 0916463 SC’13 Student Volunteer Scholarship ACM Microsoft Research Travel Award This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 Inna Rytsareva: [email protected] Ananth Kalyanaraman: [email protected] Scalable Parallel Algorithms for Clustering Large-Scale Biological Graphs School of Electrical Engineering and Computer Science, Washington State University Pullman, WA, USA Inna Rytsareva, Ananth Kalyanaraman (Advisor) http://www.mkbergman.com/ New variant of the Shingling heuristic based on a normalization technique Reduction of # singletons Ability to handle unweighted and weighted graphs MapReduce algorithm for parallelizing the new variant Extensive experimental results qualitative improvement over the standard heuristic performance improvement of the MPI implementation over the Hadoop implementation A D J A C E N C Y L I S T G(V l ,V r ,E) Sort by Shingle ID T R A G N R S A F P O H R M E D G I (V ,V l ,E’) Loop back to next shingling phase u:{v1,v2,… vk} c shingles for u Adj. list for each shingle Characterizing protein families in environmental microbial communities Already known protein seq. & families Assemble DNA & Predict genes Protein family assignment & discovery Community function inference Translate into ORFs ? ~50x106 clusters x108 sequences x106-8 new sequences Multiple community data ~350 metagenomics projects (as of today) n sequences Homology Graph Dense subgraphs to clusters All vs. All sequence comparison Remove redundant sequences Community annotation Sequence homology graphs: Wu & Kalyanaraman, 2008 Theoretical formulations are NP-Hard (Feige et al., 2001) Need for efficient heuristics Metrics for quality evaluation To allow for overlaps or not Lack of parallel tools Developments recent/ongoing (e.g., 10 th DIMACS challenge) Qualitative Assessment Input Data set, Algorithm Quality metrics Clustering statistics Modularity Density # non- singleton clusters % of singletons Largest cluster size UBC- 25M Unweighted, Standard 0.8454 0.3126±0.2506 27,479 35.15% 21,973 Unweighted, Normalized 0.9928 0.5603±0.3866 95,505 2.39% 25,827 Weighted, Normalized 0.9937 0.5603±0.3866 95,505 2.12% 25,864 Weighted, Louvain 0.9695 0.5309±0.3864 116,558 0% 13,297 Weighted, MCL 0.9383 0.5259±0.3837 118,518 0% 5,507 Input: Unweighted UBC subgraphs. Run-times are for our MPI-MapReduce implementation. Performance Study Input graph Run-time (in seconds) using p cores p=64 p=128 p=256 p=512 p=1024 p=2048 p=4096 UBC-25M 104.4 59.81 37.21 26.66 17.5 14.8 UBC-50M 159.87 100.43 50.89 32.66 25.96 17.84 20.69 UBC-100M 158.22 90.74 53.51 36.48 27.88 24.86 UBC-200M 110.57 60.23 43.66 30.53 31.14 UBC-400M 121.81 73.91 36.71 25.53 UBC-640M 102.49 70.78 36.08 MapReduce implementation matters MPI-MapReduce implementation is at least two orders of magnitude faster than Hadoop’s Unweighted vs. weighted analysis takes about the same time Adjacency list implementation is about 3x faster than edge list implementation Performance observations Experimental platform: Hopper (NERSC-6): Cray XE6 6,392 compute nodes, 153,216 cores 32 GB RAM per node Custom mpich-2 version 5.5.5 for Cray XE MapReduce-MPI library Experimental Setup Input label Number of edges Number of vertices UBC-25M 25 x 10 6 3,965 x 10 3 UBC-50M 50 x 10 6 4,525 x 10 3 UBC–100M 100 x 10 6 6,795 x 10 3 UBC-200M 200 x 10 6 7,336 x 10 3 UBC-400M 400 x 10 6 8,958 x 10 3 UBC-640M 640 x 10 6 10,346 x 10 3 environmental microbial communities Randomly permute Randomly permute If same then u and v are related c trials u v s, c: parameters s elems shingle shingle s elems u v Large number of singletons Inability to handle edge weights e.g., degree of sequence similarity Scaling to very large data sets Parallelization required for reducing time to solution Cause: Low degree vertices tend to be left out by Shingling sampling approach Two contrasting scenarios for low degree vertices: (+)ve case: Periphery (-)ve case: Chimeric Step 1) Normalize edge weight for every edge (if not already normalized) w u,vi = w u,vi w u,v1 + ... + w u,vi + ... + w u,vn C C – normalizing constant Step 2) Generate w u,vi copies for edge (u, v i ) Perform shingling on the multigraph http://lamar.colostate.edu/~jvivanco/interactions.htm http://www.wageningenur.nl/en/show/Isolation-and- characterization-of-novel-and-shortchain-fatty-acid- producing-bacteria-in-the-anaerobic-human-gut..htm http://www.nersc.gov/assets/About-Us/hopper1.jpg shingle 11 st pass G(V, V, E) 1 st shingles 2 2 nd pass GI(Vs, V, E) Dense subgraphs 3 AB GII(Vs, Vt, E) A B 2 nd shingles loose tight REFERENCES 1. Broder, A.Z. et al., (2000), ‘Min-wise independent permutations’, Journal of Computer and System Sciences, Vol. 60, pp.630–659. 2. Gibson, D., Kumar, R. and Tomkins, A. (2005), ‘Discovering large dense subgraphs in massive graphs’, Proceedings of the International Conference on Very Large Data Bases, pp.721–732. 3. Rytsareva, I., Kalyanaraman, A. (2013), ’Scalable heuristics for clustering biological graphs’, 3rd IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pp.1-6. 4. Wu, C. and Kalyanaraman, A. (2008), ‘An efficient parallel approach for identifying protein families in large-scale metagenomic data sets’, Proceedings ACM/IEEE conference on Supercomputing, pp.1–10. SC’13 ACM Student Research Competition, Denver, CO, 2013

Upload: others

Post on 11-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Parallel Algorithms for Clustering (--THIS …sc13.supercomputing.org/sites/default/files/Posters...Inna Rytsareva: inna.rytsareva@email.wsu.edu Ananth Kalyanaraman: ananth@eecs.wsu.edu

RESEARCH POSTER PRESENTATION DESIGN © 2011

www.PosterPresentations.com

(--THIS SECTION DOES NOT PRINT--)

This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly asked questions specific to this template. If you are using an older version of PowerPoint some template features may not work properly.

Verifying the quality of your graphics Go to the VIEW menu and click on ZOOM to set your preferred magnification. This template is at 100% the size of the final poster. All text and graphics will be printed at 100% their size. To see what your poster will look like when printed, set the zoom to 100% and evaluate the quality of all your graphics before you submit your poster for printing.

Using the placeholders To add text to this template click inside a placeholder and type in or paste your text. To move a placeholder, click on it once (to select it), place your cursor on its frame and your cursor will change to this symbol: Then, click once and drag it to its new location where you can resize it as needed. Additional placeholders can be found on the left side of this template.

Modifying the layout This template has four different column layouts. Right-click your mouse on the background and click on “Layout” to see the layout options. The columns in the provided layouts are fixed and cannot be moved but advanced users can modify any layout by going to VIEW and then SLIDE MASTER.

Importing text and graphics from external sources TEXT: Paste or type your text into a pre-existing placeholder or drag in a new placeholder from the left side of the template. Move it anywhere as needed. PHOTOS: Drag in a picture placeholder, size it first, click in it and insert a photo from the menu. TABLES: You can copy and paste a table from an external document onto this poster template. To adjust the way the text fits within the cells of a table that has been pasted, right-click on the table, click FORMAT SHAPE then click on TEXT BOX and change the INTERNAL MARGIN values to 0.25

Modifying the color scheme To change the color scheme of this template go to the “Design” menu and click on “Colors”. You can choose from the provide color combinations or you can create your own.

(--THIS SECTION DOES NOT PRINT--)

This PowerPoint 2007 template produces a 36”x48” professional poster. It will save you valuable time placing titles, subtitles, text, and graphics.

Use it to create your presentation. Then send it to PosterPresentations.com for premium quality, same

We provide a series of online tutorials that will guide you through the poster design process and

(copy and paste the link into your web browser).

For assistance and to order your printed poster call PosterPresentations.com at 1.866.649.3004

Use the placeholders provided below to add new elements to your poster: Drag a placeholder onto the poster area, size it, and click it to edit.

Move this preformatted section header placeholder to the poster area to add another section header. Use section headers to separate topics or concepts

Move this preformatted text placeholder to the

Move this graphic placeholder onto your poster, size it first, and then click it to add a picture to the

©"2011"PosterPresenta.ons.com"""""2117"Fourth"Street","Unit"C"""""Berkeley"CA"94710"""""[email protected]

Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.

Clustering is one of the advanced analytical functions that has an immense potential to transform the knowledgespace when applied particularly to a data-rich domain such as computational biology. The computational hardness of the underlying theoretical problem necessitate use of heuristics in practice. In this study, we evaluate the application of a randomized sampling-based heuristic called shingling on unweighted biological graphs and propose a new variant of this heuristic that extends its application to weighted graph inputs and is better positioned to achieve qualitative gains on unweighted inputs as well. We also present parallel algorithms for this heuristic using the MapReduce paradigm. Experimental results on subsets of a medium-scale real world input (containing up to 10.3M vertices and 640M edges) demonstrate significant qualitative improvements in the reported clustering, both with and without using edge weights. Furthermore, performance studies indicate near-linear scaling on up to 4K cores of a distributed memory supercomputer.

ABSTRACT

OBJECTIVES

METHODS

EXPERIMENTAL RESULTS CONTRIBUTIONS

ACKNOWLEDGMENT and CONTACTS US Department of Energy DE-SC-0006516 National Science Foundation IIS 0916463 SC’13 Student Volunteer Scholarship ACM Microsoft Research Travel Award This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231

Inna Rytsareva: [email protected] Ananth Kalyanaraman: [email protected]

Scalable Parallel Algorithms for Clustering Large-Scale Biological Graphs

School of Electrical Engineering and Computer Science, Washington State University Pullman, WA, USA

Inna Rytsareva, Ananth Kalyanaraman (Advisor)

http://www.mkbergman.com/

  New variant of the Shingling heuristic based on a normalization technique

  Reduction of # singletons

  Ability to handle unweighted and weighted graphs

  MapReduce algorithm for parallelizing the new variant   Extensive experimental results   qualitative improvement over the standard heuristic   performance improvement of the MPI implementation over the

Hadoop implementation

Mappers

Reducers

A D J A C E N C Y

L I S T

G(Vl,Vr,E)

Sort by Shingle

ID

T R A G N R S A F P O H R M E D

GI(Vs,Vl,E’)

Loop back to next shingling phase

u:{v1,v2,… vk}

c shingles for u Adj. list

for each

shingle

Characterizing protein families in environmental microbial communities

Already known protein seq. & families

Assemble DNA & Predict genes

Protein family assignment & discovery

Community function inference

Translate into ORFs

? ~50x106 clusters x108 sequences

x106-8 new sequences

Multiple community data ~350 metagenomics projects (as of today)

n sequences

Homology Graph Dense subgraphs to clusters

All vs. All sequence

comparison

Remove redundant sequences

Community annotation

Sequence homology graphs:

Wu & Kalyanaraman, 2008

  Theoretical formulations are NP-Hard (Feige et al., 2001)  Need for efficient heuristics

  Metrics for quality evaluation

  To allow for overlaps or not

  Lack of parallel tools  Developments recent/ongoing  (e.g., 10th DIMACS challenge)

Qualitative Assessment

Input Data set, Algorithm

Quality(metrics( Clustering(statistics(

Modularity Density # non-

singleton clusters

% of singletons

Largest cluster size

UBC- 25M

Unweighted, Standard 0.8454 0.3126±0.2506 27,479 35.15% 21,973

Unweighted, Normalized 0.9928 0.5603±0.3866 95,505 2.39% 25,827

Weighted, Normalized 0.9937 0.5603±0.3866 95,505 2.12% 25,864

Weighted, Louvain 0.9695 0.5309±0.3864 116,558 0% 13,297

Weighted, MCL 0.9383 0.5259±0.3837 118,518 0% 5,507

Input: Unweighted UBC subgraphs. Run-times are for our MPI-MapReduce implementation.

Performance Study Input graph

Run-time (in seconds) using p cores

p=64 p=128 p=256 p=512 p=1024 p=2048 p=4096

UBC-25M 104.4 59.81 37.21 26.66 17.5 14.8

UBC-50M 159.87 100.43 50.89 32.66 25.96 17.84 20.69

UBC-100M 158.22 90.74 53.51 36.48 27.88 24.86

UBC-200M 110.57 60.23 43.66 30.53 31.14

UBC-400M 121.81 73.91 36.71 25.53

UBC-640M 102.49 70.78 36.08

 MapReduce implementation matters – MPI-MapReduce implementation is at least two orders of magnitude faster than Hadoop’s

 Unweighted vs. weighted analysis takes about the same time  Adjacency list implementation is about 3x faster than edge list implementation

Performance observations

Experimental platform:  Hopper (NERSC-6): Cray XE6 ◦ 6,392 compute nodes, 153,216 cores ◦ 32 GB RAM per node ◦ Custom mpich-2 version 5.5.5 for Cray XE ◦ MapReduce-MPI library

Experimental Setup

Input label Number of edges

Number of vertices

UBC-25M 25 x 106 3,965 x 103

UBC-50M 50 x 106 4,525 x 103

UBC–100M 100 x 106 6,795 x 103

UBC-200M 200 x 106 7,336 x 103

UBC-400M 400 x 106 8,958 x 103

UBC-640M 640 x 106 10,346 x 103

environmental microbial communities

Randomly permute

Randomly permute

If same then u and v are related

c trials

u

v

s, c: parameters

s elems shingle shingle s elems

… …

u v

… …

 Large number of singletons

 Inability to handle edge weights  e.g., degree of sequence similarity

 Scaling to very large data sets  Parallelization required for reducing time to solution

Cause: Low degree vertices tend to be left out by Shingling sampling approach

Two contrasting scenarios for low degree vertices:

(+)ve case: Periphery (-)ve case: Chimeric

Step 1) Normalize edge weight for every edge (if not already normalized)

wu,vi=

wu,vi

wu,v1+ ...+ wu,vi

+ ...+ wu,vn

∗C

C – normalizing constant

Step 2) Generate wu,vi copies for edge (u, vi)

Perform shingling on the multigraph

http://lamar.colostate.edu/~jvivanco/interactions.htm http://www.wageningenur.nl/en/show/Isolation-and-characterization-of-novel-and-shortchain-fatty-acid-producing-bacteria-in-the-anaerobic-human-gut..htm

http://www.nersc.gov/assets/About-Us/hopper1.jpg

shingle

1 1st pass

G(V, V, E)

1st s

hing

les

… …

2 2nd pass

GI(Vs, V, E)

Den

se

subg

raph

s

3 A≅B

GII(Vs, Vt, E)

A

B 2n

d sh

ingl

es

loose tight

REFERENCES 1. Broder, A.Z. et al., (2000), ‘Min-wise independent permutations’, Journal of Computer and System Sciences, Vol. 60, pp.630–659. 2. Gibson, D., Kumar, R. and Tomkins, A. (2005), ‘Discovering large dense subgraphs in massive graphs’, Proceedings of the International Conference on Very Large Data Bases, pp.721–732. 3. Rytsareva, I., Kalyanaraman, A. (2013),  ’Scalable heuristics for clustering biological graphs’, 3rd IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pp.1-6. 4. Wu, C. and Kalyanaraman, A. (2008), ‘An efficient parallel approach for identifying protein families in large-scale metagenomic data sets’, Proceedings ACM/IEEE conference on Supercomputing, pp.1–10.

SC’13 ACM Student Research Competition, Denver, CO, 2013

Inna Rytsareva
Inna Rytsareva