disco : distributed co-clustering with map-reduce
DESCRIPTION
DisCo : Distributed Co-clustering with Map-Reduce. Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui , Ku. Outline. Introduction Related Work Distributed Mining Process Co-clustering Huge Datasets Experiments Conclusions. - PowerPoint PPT PresentationTRANSCRIPT
DISCO: DISTRIBUTED CO-
CLUSTERING WITH MAP-REDUCE
Spiros Papadimitriou Jimeng SunIBM T.J. Watson Research Center
Hawthorne, NY, USAReporter: Nai-Hui, Ku
OUTLINE Introduction Related Work Distributed Mining Process Co-clustering Huge Datasets Experiments Conclusions
INTRODUCTION Problems
Huge datasetsNatural sources of data are impure form
Proposed MethodA comprehensive Distributed Co-clustering
(DisCo) solutionUsing HadoopDisCo is a scalable framework under which
various co-clustering algorithms can be implemented
RELATED WORK Map-Reduce framework
employs a distributed storage clusterblock-addressable storagea centralized metadata servera convenient data accessstorage API for Map-Reduce tasks
RELATED WORK Co-clustering
Algorithm cluster shapes
checkerboard partitionssingle bi-clusterExclusive row and column partitionsoverlapping partitions
Optimization criteria code length
DISTRIBUTED MINING PROCESS
Identifying the source and obtaining the data
Transform raw data into the appropriate format
for data analysis
Visual results, or turned into the input for other applications.
DISTRIBUTED MINING PROCESS (CONT.) Data pre-processing
Processing 350 GB raw network event log Needs over 5 hours to extract source/destination
IP pairsAchieve much better performance on a few
commodity nodes running HadoopSetting up Hadoop required minimal effort
DISTRIBUTED MINING PROCESS (CONT.)
Specifically for co-clustering, there are two main preprocessing tasks: Building the graph from raw data Pre-computing the transpose
During co-clustering optimization, we need to iterate over both rows and columns.
Need to pre-compute the adjacency lists for both the original graph as well as its transpose
CO-CLUSTERING HUGE DATASETS Definitions and overview
Matrices are denoted by boldface capital letters Vectors are denoted by boldface lowercase
letters aij:the (i, j)-th element of matrix A Co-clustering algorithms employs a
checkerboard the original adjacency matrix a grid of sub-
matrices An m x n matrix, a co-clustering is a pair of row
and column labeling vectors r(i):the i-th row of the matrix G: the k×ℓ group matrix
A
a
CO-CLUSTERING HUGE DATASETS (CONT.)
gpq gives the sufficient statistics for the (p, q) sub-matrix
CO-CLUSTERING HUGE DATASETS (CONT.) Map function
CO-CLUSTERING HUGE DATASETS (CONT.) Reduce function
CO-CLUSTERING HUGE DATASETS (CONT.) Global sync
EXPERIMENTS Setup
39 nodesTwo dual-core processors8GM RAMLinux RHEL44Gbps EthernetsSATA, 65MB/sec or roughly 500 MbpsThe total capacity of our HDFS cluster was just
2.4 terabytesHDFS block size was set to 64MB (default value) JAVASun JDK version 1.6.0_03
EXPERIMENTS (CONT.) The pre-processing step on the ISS data Default values
39 nodes6 concurrent maps per node5 reduce tasks256MB input split size
EXPERIMENTS (CONT.)
CONCLUSIONS Using relatively low-cost components
I/O rates that exceed those of high-performance storage systems.
Performance scales almost linearly with the number of machines/disks.