disco : distributed co-clustering with map-reduce

17
DISCO: DISTRIBUTED CO- CLUSTERING WITH MAP-REDUCE Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku

Upload: diata

Post on 24-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

DisCo : Distributed Co-clustering with Map-Reduce. Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui , Ku. Outline. Introduction Related Work Distributed Mining Process Co-clustering Huge Datasets Experiments Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DisCo : Distributed Co-clustering with Map-Reduce

DISCO: DISTRIBUTED CO-

CLUSTERING WITH MAP-REDUCE

Spiros Papadimitriou Jimeng SunIBM T.J. Watson Research Center

Hawthorne, NY, USAReporter: Nai-Hui, Ku

Page 2: DisCo : Distributed Co-clustering with Map-Reduce

OUTLINE Introduction Related Work Distributed Mining Process Co-clustering Huge Datasets Experiments Conclusions

Page 3: DisCo : Distributed Co-clustering with Map-Reduce

INTRODUCTION Problems

Huge datasetsNatural sources of data are impure form

Proposed MethodA comprehensive Distributed Co-clustering

(DisCo) solutionUsing HadoopDisCo is a scalable framework under which

various co-clustering algorithms can be implemented

Page 4: DisCo : Distributed Co-clustering with Map-Reduce

RELATED WORK Map-Reduce framework

employs a distributed storage clusterblock-addressable storagea centralized metadata servera convenient data accessstorage API for Map-Reduce tasks

Page 5: DisCo : Distributed Co-clustering with Map-Reduce

RELATED WORK Co-clustering

Algorithm cluster shapes

checkerboard partitionssingle bi-clusterExclusive row and column partitionsoverlapping partitions

Optimization criteria code length

Page 6: DisCo : Distributed Co-clustering with Map-Reduce

DISTRIBUTED MINING PROCESS

Identifying the source and obtaining the data

Transform raw data into the appropriate format

for data analysis

Visual results, or turned into the input for other applications.

Page 7: DisCo : Distributed Co-clustering with Map-Reduce

DISTRIBUTED MINING PROCESS (CONT.) Data pre-processing

Processing 350 GB raw network event log Needs over 5 hours to extract source/destination

IP pairsAchieve much better performance on a few

commodity nodes running HadoopSetting up Hadoop required minimal effort

Page 8: DisCo : Distributed Co-clustering with Map-Reduce

DISTRIBUTED MINING PROCESS (CONT.)

Specifically for co-clustering, there are two main preprocessing tasks: Building the graph from raw data Pre-computing the transpose

During co-clustering optimization, we need to iterate over both rows and columns.

Need to pre-compute the adjacency lists for both the original graph as well as its transpose

Page 9: DisCo : Distributed Co-clustering with Map-Reduce

CO-CLUSTERING HUGE DATASETS Definitions and overview

Matrices are denoted by boldface capital letters Vectors are denoted by boldface lowercase

letters aij:the (i, j)-th element of matrix A Co-clustering algorithms employs a

checkerboard the original adjacency matrix a grid of sub-

matrices An m x n matrix, a co-clustering is a pair of row

and column labeling vectors r(i):the i-th row of the matrix G: the k×ℓ group matrix

A

a

Page 10: DisCo : Distributed Co-clustering with Map-Reduce

CO-CLUSTERING HUGE DATASETS (CONT.)

gpq gives the sufficient statistics for the (p, q) sub-matrix

Page 11: DisCo : Distributed Co-clustering with Map-Reduce

CO-CLUSTERING HUGE DATASETS (CONT.) Map function

Page 12: DisCo : Distributed Co-clustering with Map-Reduce

CO-CLUSTERING HUGE DATASETS (CONT.) Reduce function

Page 13: DisCo : Distributed Co-clustering with Map-Reduce

CO-CLUSTERING HUGE DATASETS (CONT.) Global sync

Page 14: DisCo : Distributed Co-clustering with Map-Reduce

EXPERIMENTS Setup

39 nodesTwo dual-core processors8GM RAMLinux RHEL44Gbps EthernetsSATA, 65MB/sec or roughly 500 MbpsThe total capacity of our HDFS cluster was just

2.4 terabytesHDFS block size was set to 64MB (default value) JAVASun JDK version 1.6.0_03

Page 15: DisCo : Distributed Co-clustering with Map-Reduce

EXPERIMENTS (CONT.) The pre-processing step on the ISS data Default values

39 nodes6 concurrent maps per node5 reduce tasks256MB input split size

Page 16: DisCo : Distributed Co-clustering with Map-Reduce

EXPERIMENTS (CONT.)

Page 17: DisCo : Distributed Co-clustering with Map-Reduce

CONCLUSIONS Using relatively low-cost components

I/O rates that exceed those of high-performance storage systems.

Performance scales almost linearly with the number of machines/disks.