disco : distributed co-clustering with map-reduce

DISCO: DISTRIBUTED CO-

CLUSTERING WITH MAP-REDUCE

Spiros Papadimitriou Jimeng SunIBM T.J. Watson Research Center

Hawthorne, NY, USAReporter: Nai-Hui, Ku

OUTLINE Introduction Related Work Distributed Mining Process Co-clustering Huge Datasets Experiments Conclusions

INTRODUCTION Problems

Huge datasetsNatural sources of data are impure form

Proposed MethodA comprehensive Distributed Co-clustering

(DisCo) solutionUsing HadoopDisCo is a scalable framework under which

various co-clustering algorithms can be implemented

RELATED WORK Map-Reduce framework

employs a distributed storage clusterblock-addressable storagea centralized metadata servera convenient data accessstorage API for Map-Reduce tasks

RELATED WORK Co-clustering

Algorithm cluster shapes

checkerboard partitionssingle bi-clusterExclusive row and column partitionsoverlapping partitions

Optimization criteria code length

DISTRIBUTED MINING PROCESS

Identifying the source and obtaining the data

Transform raw data into the appropriate format

for data analysis

Visual results, or turned into the input for other applications.

DISTRIBUTED MINING PROCESS (CONT.) Data pre-processing

Processing 350 GB raw network event log Needs over 5 hours to extract source/destination

IP pairsAchieve much better performance on a few

commodity nodes running HadoopSetting up Hadoop required minimal effort

DISTRIBUTED MINING PROCESS (CONT.)

Specifically for co-clustering, there are two main preprocessing tasks: Building the graph from raw data Pre-computing the transpose

During co-clustering optimization, we need to iterate over both rows and columns.

Need to pre-compute the adjacency lists for both the original graph as well as its transpose

CO-CLUSTERING HUGE DATASETS Definitions and overview

Matrices are denoted by boldface capital letters Vectors are denoted by boldface lowercase

letters aij:the (i, j)-th element of matrix A Co-clustering algorithms employs a

checkerboard the original adjacency matrix a grid of sub-

matrices An m x n matrix, a co-clustering is a pair of row

and column labeling vectors r(i):the i-th row of the matrix G: the k×ℓ group matrix

A

a

CO-CLUSTERING HUGE DATASETS (CONT.)

gpq gives the sufficient statistics for the (p, q) sub-matrix

CO-CLUSTERING HUGE DATASETS (CONT.) Map function

CO-CLUSTERING HUGE DATASETS (CONT.) Reduce function

CO-CLUSTERING HUGE DATASETS (CONT.) Global sync

EXPERIMENTS Setup

39 nodesTwo dual-core processors8GM RAMLinux RHEL44Gbps EthernetsSATA, 65MB/sec or roughly 500 MbpsThe total capacity of our HDFS cluster was just

2.4 terabytesHDFS block size was set to 64MB (default value) JAVASun JDK version 1.6.0_03

EXPERIMENTS (CONT.) The pre-processing step on the ISS data Default values

39 nodes6 concurrent maps per node5 reduce tasks256MB input split size

EXPERIMENTS (CONT.)

CONCLUSIONS Using relatively low-cost components

I/O rates that exceed those of high-performance storage systems.

Performance scales almost linearly with the number of machines/disks.

disco : distributed co-clustering with map-reduce

Documents

pragmatic data mining

datatransform raw data

data preprocessingprocessing

data analysisvisual

huge datasetsdefinitions

th row group

ith row

q submatrix