distributed and parallel databasesftp.cse.buffalo.edu/users/azhang/disc/dapd/dapd-11-2.pdfresources...

116
KLUWER ACADEMIC PUBLISHERS DISTRIBUTED AND PARALLEL DATABASES Volume 11, No. 2, March 2002 Special Issue: Parallel and Distributed Data Mining Guest Editor: Mohammed J. Zaki and Yi Pan Introduction: Recent Developments in Parallel and Distributed Data Mining ♣♣♣♣♣ ♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣ Mohammed J. Zaki and Yi Pan 123 Shared State for Distributed Interactive Data Mining Applications ♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣ ♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣ Srinivasan Parthasarathy and Sandhya Dwarkadas 129 RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets ♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣ ♣♣♣♣♣♣♣ Nagiza F. Samatova, George Ostrouchov, Al Geist and Anatoli V. Melechko 157 Parallelizing the Data Cube ♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣ ♣♣♣♣♣♣♣♣♣ Frank Dehne, Todd Eavis, Susanne Hambrusch and Andrew Rau-Chaplin 181 Boosting Algorithms for Parallel and Distributed Learning ♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣ ♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣ Aleksandar Lazarevic and Zoran Obradovic 203

Upload: others

Post on 08-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

DISTRIBUTEDAND

PARALLELDATABASES

Volume 11, No. 2, March 2002

Special Issue: Parallel and Distributed Data MiningGuest Editor: Mohammed J. Zaki and Yi Pan

Introduction: Recent Developments in Parallel and Distributed Data Mining � � � � �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Mohammed J. Zaki and Yi Pan 123

Shared State for Distributed Interactive Data Mining Applications � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Srinivasan Parthasarathy and Sandhya Dwarkadas 129

RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies fromDistributed Datasets � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� � � � � � � Nagiza F. Samatova, George Ostrouchov, Al Geist and Anatoli V. Melechko 157

Parallelizing the Data Cube � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� � � � � � � � � Frank Dehne, Todd Eavis, Susanne Hambrusch and Andrew Rau-Chaplin 181

Boosting Algorithms for Parallel and Distributed Learning � � � � � � � � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Aleksandar Lazarevic and Zoran Obradovic 203

Page 2: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Distributed and Parallel Databases is published bi-monthly.

Subscription Rates

The Institutional subscription price for 2002, Volumes 11 and 12 (3 issues per volume), includingpostage and handling is:

Print OR Electronic Version: EURO 534.00.00/ US $535.00 per yearPrint AND Electronic Version: EURO 640.80/ US $642.00 per year

Ordering Information/Sample Copies

Subscription orders and requests for sample copies should be sent to:

Kluwer Academic Publishers or Kluwer Academic Publishers101 Philip Drive P.O. Box 322Assinippi Park 3300 AH DordrechtNorwell, MA 02061 USA The Netherlandsphone: (781) 871-6600fax: (781) 871-6528e-mail: [email protected]

or to any subscription agent.

c© 2002 Kluwer Academic PublishersNo part of the material protected by this copyright notice may be reproduced or utilized in any formor by any means, electronic or mechanical, including photocopying, recording, or by any informationstorage and retrieval system, without written permission from the copyright owner.

Photocopying. In the U.S.A.: This journal is registered at the Copyright Clearance Center, Inc., 222Rosewood Drive, Danvers, MA 01923, U.S.A.

Authorization to photocopy items for internal or personal use, or the internal or personal use or specificclients, is granted by Kluwer Academic Publishers for users registered with the Copyright ClearanceCenter (CCC). The “services” for users can be found on the internet at: www.copyright.com. Forthose organizations that have been granted a photocopy license, a separate system of payment hasbeen arranged. Authorization does not extend to other kinds of copying, such as that for generaldistribution, for advertising of promotional purposes, for creating new collective works, or for resale.In the rest of the world: Permission to photocopy must be obtained from the copyright owner. Pleaseapply to Kluwer Academic Publishers, P.O. Box 17, 3300 AA Dordrecht, the Netherlands.

Periodicals postage paid at Rahway, NJ USPS: 010-328

U.S. mailing agent: Mercury Airfreight International, Ltd.365 Blair RoadAvenel, NJ 07001 U.S.A.

Published by Kluwer Academic Publishers

Postmaster: Please send all address corrections to: Distributed and Parallel Databases, c/o MercuryAirfreight International, Ltd., 365 Blair Road, Avenel, NJ 07001, U.S.A.

ISSN: 0926-8782

Printed on acid-free paper

Page 3: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Distributed and Parallel Databases, 11, 123–127, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Introduction: Recent Developments in Paralleland Distributed Data Mining

MOHAMMED J. ZAKI [email protected] Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180, USA

YI PAN [email protected] of Computer Science, Georgia State University, Atlanta, GA 30303, USA

Introduction

Data Mining and Knowledge Discovery in Databases (KDD) is an interdisciplinary fieldmerging ideas from statistics, machine learning, databases, and parallel and distributedcomputing. It has been engendered by the phenomenal growth of data in all spheres ofhuman endeavor, and the economic and scientific need to extract useful information fromthe collected data. The key challenge in data mining is the extraction of knowledge andinsight from massive databases.

KDD refers to the overall process of discovering new patterns or building models froma given dataset. There are many steps involved in the KDD enterprise which include dataselection, data cleaning and preprocessing, data transformation and reduction, data-miningtask and algorithm selection, and finally post-processing and interpretation of discoveredknowledge. This KDD process tends to be highly iterative and interactive.

Typically data mining has the two high level goals of prediction and description. Inprediction, we are interested in building a model that will predict unknown or future valuesof attributes of interest, based on known values of some attributes in the database. In KDDapplications, the description of the data in human-understandable terms is equally if not moreimportant than prediction. Two main forms of data mining can be identified. In verification-driven data mining the user postulates a hypothesis, and the system tries to validate it.The common verification-driven operations include query and reporting, multidimensionalanalysis or On-Line Analytical Processing (OLAP), and statistical analysis. Discovery-driven mining, on the other hand, automatically extracts new information from data. Thetypical discovery-driven tasks include association rules, sequential patterns, classificationand regression, clustering, similarity search, deviation detection, etc.

While data mining has its roots in the traditional fields of machine learning and statistics,the sheer volume of data today poses the most serious problem. For example, many com-panies already have data warehouses in the terabyte range (e.g., FedEx, UPS, Walmart). Inaddition to business oriented data mining, data mining and domain knowledge plays a sig-nificant role in knowledge discovery and refinement in engineering, scientific, and medicaldatabases, which are reaching gigantic proportions (e.g., NASA space missions, HumanGenome Project) and require both large memory and disk space and high speed computing.

Page 4: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

124 ZAKI AND PAN

Traditional methods typically made the assumption that the data is memory resident. Thisassumption is no longer tenable. Implementation of data mining ideas in high-performanceparallel and distributed computing environments is thus becoming crucial for ensuringsystem scalability and interactivity as data continues to grow inexorably in size andcomplexity.

This special issue of Distributed and Parallel Databases provides a forum for the sharingof original research results and practical development experiences among researchers andapplication developers from different areas related to parallel and distributed data mining.Papers for this special issue were selected to address data mining methods and processesfrom both an algorithmic and systems perspective in parallel and distributed environments.

The algorithmic aspects involve the design of efficient, scalable, disk-based, paralleland distributed algorithms for large-scale data mining tasks. The challenge is to developmethods that scale to thousands of attributes and millions of transactions. The techniques ofinterest span all major classes of data mining methods such as association rules, sequences,classification, clustering, deviation detection, as well as various pre-processing and post-processing operations like sampling, feature selection, data reduction and transformation,rule grouping and pruning, exploratory and interactive browsing, meta-level mining, etc.

The systems issues focus on actual implementation of the algorithms on a variety of par-allel hardware platforms, including shared-memory systems (SMPs), distributed-memorysystems, network of workstations, hybrid systems consisting of a cluster of SMPs, geo-graphically distributed systems, etc. The key challenges include improving the load balanc-ing, improving locality, eliminating false sharing on SMPs, minimizing synchronization,minimizing communication, maximizing accuracy of distributed models, integrating het-erogeneous sources, and finding appropriate data layouts. Papers dealing with integrationof mining with databases and data-warehousing, as well as successful applications, werealso sought.

Articles in this special issue

Through rigorous reviews involving 36 referees, four papers were chosen from a pool of12 papers submitted to this special issue. This is reflected in the high quality of the papersaccepted.

In the first paper entitled, “Shared State for Distributed Interactive Data Mining Appli-cations,” Parthasarathy and Dwarkadas present and evaluate a distributed interactive datamining system, called InterAct, which supports data sharing efficiently by allowing caching,by communicating only the modified data, and by allowing relaxed coherence requirementspecification for reduced communication overhead.

A typical clustering algorithm requires bringing all the data in a centralized warehouse,and involves large transmission cost. In the second paper entitled, “RACHET: An EfficientCover-Based Merging of Clustering Hierarchies from Distributed Datasets,” Samatova et al.present a hierarchical clustering method named RACHET for analyzing multi-dimensionaldistributed data. RACHET runs with at most linear time, space, and communication coststo build a global hierarchy of comparable clustering quality by merging locally generatedclustering hierarchies.

Page 5: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

INTRODUCTION 125

In the paper, “Parallelizing the Data Cube,” Dehne et al. present a general methodologyfor the efficient parallelization of existing data cube construction algorithms. The methodsused reduce inter-processor communication overhead by partitioning the load in advanceand enable code reuse by permitting the use of existing sequential data cube algorithms forthe sub-cube computations on each processor.

Boosting is a popular technique for constructing highly accurate classifier ensembles.In the last paper, “Boosting Algorithms for Parallel and Distributed Learning,” Lazarevicand Obradovic present new parallel and distributed boosting algorithms. They have appliedtheir proposed methods to several data sets and the results show that parallel boosting canachieve the same or even better prediction accuracy, yet is much faster, than the standardsequential boosting.

In our opinion, the selected papers cover as broad as a range of topics possible withinthe area of parallel and distributed data mining. We hope the research community finds thisspecial issue of use and interest. We thank all those who submitted papers to this specialissue. We are also thankful to Professor Ahmed K. Elmagarmid, the editor-in-chief of DPD,for his guidance and support in integrating this issue. Many thanks also go to the reviewersfor their prompt responses and helpful comments.

Resources on parallel and distributed KDD

There is a wealth of resources available for further exploration of parallel and distributedKDD, such as books, journal special issues and special topics workshops, as listed below.

Books on parallel and distributed data mining

1. A. Freitas and S. Lavington, Mining Very Large Databases with Parallel Processing,Kluwer Academic: Boston, MA, 1998.

2. M.J. Zaki and C.-T. Ho (Eds.), Large-Scale Parallel Data Mining, LNAI State-of-the-ArtSurvey, Vol. 1759, Springer-Verlag: Berlin, 2000.

3. H. Kargupta and P. Chan (Eds.), Advance in Distributed and Parallel KnowledgeDiscovery, AAAI Press, 2000.

Journal special issues

1. H. Kargupta, J. Ghosh, V. Kumar, and Z. Obradovic (Eds.), “Distributed and ParallelKnowledge Discovery,” Knowledge and Information Systems, vol. 3, no. 4, 2001.

2. V. Kumar, S. Ranka, and V. Singh (Eds.), “High performance data mining,” Journal ofParallel and Distributed Computing, vol. 61, no. 3, 2001.

3. A. Zomaya, T. El-Ghazawi, and O. Frieder (Eds.), “Parallel and Distributed Computingfor Data Mining,” IEEE Concurrency, vol. 7, no. 4, 1999.

4. Y. Guo and R. Grossman (Eds.), “Scalable Parallel and Distributed Data Mining,” DataMining and Knowledge Discovery: An International Journal, vol. 3, no. 3, 1999.

5. P. Stolorz and R. Musick (Eds.), “Scalable High-Performance Computing for KDD,”Data Mining and Knowledge Discovery: An International Journal, vol. 1, no. 4, 1997.

Page 6: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

126 ZAKI AND PAN

Workshops

1. 4th IEEE IPDPS International Workshop on Parallel and Distributed Data Mining, 2001.http://www.cs.rpi.edu/∼zaki/PDDM01

2. HiPC Special Session on Large-Scale Data Mining, 2000.http://www.cs.rpi.edu/∼zaki/ LSDM/

3. ACM SIGKDD Workshop on Distributed Data Mining, 2000.http://www.eecs.wsu.edu/ ∼hillol/DKD/dpkd2000.html

4. 3rd IEEE IPDPS Workshop on High Performance Data Mining, 2000.http://www.cs.rpi. edu/∼zaki/HPDM/

5. ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems, 1999.http://www.cs. rpi.edu/∼zaki/WKDD99/

6. ACM SIGKDD Workshop on Distributed Data Mining, 1998.http://www.eecs.wsu.edu/ ∼hillol/DDMWS/papers.html

List of reviewers

Charu C. Aggarwal, Gagan Agrawal, Daniel Barbara, Raj Bhatnagar, Christopher W.Clifton, Ayhan Demiriz, Sanjay Goil, Robert L. Grossman, Himanshu Gupta, Eui-HongHan, Benjamin C.M. Kao, Hillol Kargupta, George Karypis, Bing Liu, Malik Magdon-Ismail, William A. Maniatty, Ron Musick, Salvatore Orlando, Srinivasan Parthasarathy,Jian Pei, Sakti Pramanik, Andreas Prodromidis, Naren Ramakrishnan, Rajeev Rastogi,Tobias Scheffer, Paola Sebastiani, David B. Skillicorn, Domenico Talia, Kathryn Thornton,Haixun Wang, Jason T.L. Wang, Graham Williams, Xindong Wu, and Osmar R. Zaiane.

Information about guest editors

Mohammed J. Zaki is currently an assistant professor of Computer Science at RensselaerPolytechnic Institute. He received his M.S. and Ph.D. degrees in Computer Science fromthe University of Rochester in May 1995 and July 1998, respectively. His research interestsinclude the design of efficient, scalable, and parallel algorithms for various data mining tech-niques. He is specially interested developing novel data mining techniques for applicationslike bioinformatics, web mining, and materials informatics.

Dr. Zaki received a National Science Foundation CAREER Award in 2001 for his workon parallel and distributed data mining; he has published over 50 papers in this area. He is aneditor of the book, “Large-scale Parallel Data Mining,” LNAI Vol. 1759, Springer-Verlag,2000. In the past he has co-chaired several workshops in High-Performance, Parallel andDistributed Data Mining. He is a member of the ACM (SIGKDD, SIGMOD), and IEEE(IEEE Computer Society).Yi Pan is an associate professor in the Department of Computer Science at Georgia StateUniversity. Previously, he was a faculty member in the Department of Computer Scienceat the University of Dayton. He received his B.Eng. degree in Computer Engineering fromTsinghua University, China, in 1982, and his Ph.D. degree in Computer Science from theUniversity of Pittsburgh, USA, in 1991.

Page 7: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

INTRODUCTION 127

He has published more than 110 research papers. He has received many awards includingVisiting Researcher Support Program Award from the International Information ScienceFoundation (2001), Outstanding Scholarship Award of the College of Arts and Sciences atUniversity of Dayton (1999), the Japanese Society for the Promotion of Science Fellowship(1998), AFOSR Summer Faculty Fellowship (1997), NSF Research Opportunity Award(1994, 1996), Andrew Mellon Fellowship from Mellon Foundation (1990), the best paperaward from PDPTA ’96 (1996), and Summer Research Fellowship from the ResearchCouncil of the University of Dayton (1993). Dr. Pan is currently an area editor-in-chiefof the Journal of Information, an associate editor of IEEE Transactions on Systems, Man,and Cybernetics, an editor of the Journal of Parallel and Distributed Computing Practices,an associate editor of the International Journal of Parallel and Distributed Systems andNetworks, and on the editorial board of the Journal of Supercomputing. He has also servedas a guest editor of special issues for several journals. Dr. Pan is a senior member of IEEEand a member of the IEEE Computer Society.

Page 8: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Page 9: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Distributed and Parallel Databases, 11, 129–155, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Shared State for Distributed Interactive DataMining Applications*

SRINIVASAN PARTHASARATHY [email protected] and Information Science, Ohio State University, Columbus, OH 43235, USA

SANDHYA DWARKADAS [email protected] Science, University of Rochester, Rochester, NY 14627, USA

Recommended by: Mohammed J. Zaki

Abstract. Distributed data mining applications involving user interaction are now feasible due to advances inprocessor speed and network bandwidth. These applications are traditionally implemented using ad-hoc commu-nication protocols, which are often either cumbersome or inefficient. This paper presents and evaluates a systemfor sharing state among such interactive distributed data mining applications, developed with the goal of providingboth ease of programming and efficiency. Our system, called InterAct, supports data sharing efficiently by allowingcaching, by communicating only the modified data, and by allowing relaxed coherence requirement specificationfor reduced communication overhead, as well as placement of data for improved locality, on a per client and perdata structure basis. Additionally, our system supports the ability to supply clients with consistent copies of shareddata even while the data is being modified.

We evaluate the performance of the system on a set of data mining applications that perform queries on datastructures that summarize information from the databases of interest. We demonstrate that providing a runtimesystem such as InterAct results in a 10–30 fold improvement in execution time due to shared data caching,the applications’ ability to tolerate stale data (client-controlled coherence), and the ability to off-load some of thecomputation from the server to the client. Performance is improved without requiring complex communicationprotocols to be built into the application, since the runtime system uses knowledge about application behavior(encoded by specifying coherence requirements) in order to automatically optimize the resources utilized forcommunication. We also demonstrate that for our benchmark tests, the quality of the results generated is notsignificantly deteriorated due to the use of more relaxed coherence protocols.

1. Introduction

The explosive growth in data collection techniques and database technology has resulted inlarge and dynamically growing datasets at many organizations. For these datasets to be use-ful, data mining, the process of extracting useful information from such datasets, must beperformed. The datasets at these organizations are typically in a remote repository acces-sible via a local or inter-network. Despite advances in processor and network technology,remote data mining is made difficult by the prohibitive bandwidth requirements imposedby the size of the data involved and the low latency requirements imposed by the interactivenature of data mining. The size of the datasets prohibit transferring the entire data to the

∗This is an expanded version of our conference paper in the First SIAM International Conference on DataMining [36].

Page 10: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

130 PARTHASARATHY AND DWARKADAS

remote client(s). In addition, data mining is often an iterative process with the user tweak-ing the supplied parameters according to domain-specific knowledge. This compounds theproblem of increased response times due to network and server delays.

These applications can often be structured so that subsequent requests can operate onrelatively small summary data structures [37]. Once the summary structure is computedand communicated to the client, interactions can take place on the client without furthercommunication with the server. Also, the mining processes are often independently deployedand perform very different operations, resulting in different traversals of the shared data.The summary is based on the snapshot of the actual data at any point in time. If the data isdynamically being modified, the summary is likely to change. In this scenario, the client’scopy of the summary structure must be kept up-to-date.

The above communication can be accomplished by employing some form of messagepassing or remote procedure call (RPC) in order to keep the data coherent. However, thesetechniques are often cumbersome, and can be inefficient. RPC mechanisms work well for“function shipping”—moving the process to the data—but they do not work well for movingdata to the process. Using message passing requires users to invent ad-hoc communicationand coherence protocols in order to manage data copies. Programming ease concerns sug-gest the need for an abstraction of shared state that is similar in spirit to distributed sharedmemory (DSM) semantics. However, even the most relaxed DSM coherence model (re-lease consistency [11, 18]) can result in a prohibitively large amount of communication forthe type of environment in which data mining may typically be performed. These appli-cations can often accept a significantly more relaxed—and hence less costly—coherenceand consistency model, resulting in excellent performance gains. In fact, as we show inthis paper, some tasks might require updates to shared data at regular intervals instead ofwhenever the data is modified, while others might require updates whenever the data ismodified “by a certain amount”. In other words, such applications can tolerate stale databased on a temporal or change-based criterion, thereby reducing communication overheadand improving efficiency. Hence, overall system performance can be improved by allow-ing each client to specify the data shared as well as the coherence model required for itsneeds.

In this paper, we describe our runtime framework, called InterAct [35, 36], developedwith such active data mining applications in mind, that allows efficient caching and sharingof data among independently deployed clients and servers. InterAct supports data sharingefficiently by communicating only the modified data, and by allowing individual clients tospecify relaxed coherence requirements on a per data structure basis. Additional features ofthe system include dynamic data placement for memory locality (handling the problem ofdifferent traversals of the shared data by different processes and made feasible because of theaddress-independent nature of the system) and the ability to supply clients with consistentcopies of shared data even while the data is being modified (made feasible through the useof virtual memory mechanisms to implement coherence and consistency).

The interface is general enough to support a wide range of application domains includingthe visualization of scientific simulations and the remote tracking of images. We focus hereon several applications from the interactive data mining domain and use them to demonstratethe advantages of the InterAct system. These applications are structured so that the server

Page 11: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 131

is responsible for creating the data structure(s) (storing the summary), mapping them to avirtual shared dataspace, and subsequently keeping them up-to-date. The client can thenmap the data structure(s) from the virtual shared dataspace under an appropriate coherencemodel. We show that executing queries using the appropriate summary structure can im-prove performance significantly; up to a 23-fold improvement in query execution times wasobserved. When the clients cache the summary structure using relaxed coherence models,we also observed several orders of magnitude reduction in update costs. Furthermore, forthese applications, using such relaxed models does not significantly affect the quality of theresults (we observed <2% degradation in result quality).

The rest of this paper is organized as follows. In Section 2, we outline our overallsystem design goals and compare our system to related work in Section 3. In Section 4,we describe the interface and implementation of InterAct, and demonstrate its use throughan example application. We evaluate the utility of the system using several applicationsfrom the interactive data mining domain, described in Section 5. Experimental resultsare presented in Section 6. Finally, our conclusions and on-going work are outlined inSection 7.

2. Design goals

In order to accomplish the goal of efficiently providing shared state in a distributed envi-ronment, the runtime system must provide an interface that defines a mechanism to declareshared data that is address space independent and persistent (so that clients can join andleave at any time). In addition to the above minimal requirement for sharing, data miningapplications have several properties that can be exploited, and key needs that ideally mustbe supported.

First, since clients have differing needs in terms of how up-to-date a copy of the datais acceptable, the system must identify, define, and support different relaxed coherencemodels that may be exploited for application performance. This feature of client-controlledcoherence is similar to the notion of quasi-caching [4] (see Section 3).

Second, many data mining applications require the capability of obtaining a consistentversion of a shared data structure at any time, even in the presence of an on-going updateto the data. We refer to this feature as anytime updates.

Third, many data mining applications traverse these summary structures in an orderedmanner. Different clients may have different access patterns depending on the kind of queriesprocessed. This feature requires that the system export programmer-controlled primitivesthat allows data to be remapped or placed in local memory in a manner that mirrors howthe data is likely to be accessed by a given user or client. We refer to this feature as client-controlled memory placement.

Fourth, the shared data, although significantly reduced (summary), can still be quitelarge, so re-sending it on each update can cause significant delays on a busy network. It istherefore important for the runtime system to identify which parts of the shared data havebeen modified since the client’s last update. Only the changes need be sent to the client onan update. In Section 4, we describe how our system, InterAct, can support the above datamining requirements while providing a general interface for a large class of applications.

Page 12: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

132 PARTHASARATHY AND DWARKADAS

3. Related work

There is a rich body of literature studying the issues in caching and data sharing in manydifferent computing environments. Distributed object-based systems [6, 7, 17, 25, 26, 30, 45,51, 53], all support the basic requirement of sharing address-independent objects. However,update propagation in such systems, which are typically supported either by invalidate andresend on access or by RMI-style mechanisms, are inefficient (re-sending a large object or alog of operations (RMI)) and often infeasible (especially if the methods require data availableonly on the server side) for data mining applications. Distributed shared memory systems [3,5, 11, 23, 24, 29, 44, 46, 55] all support transparent sharing of data amongst remote processes,with efficient update propagation, but most require tight coupling of processes with sharingthat is not address-independent. None of the above systems support flexible client-controlledcoherence, client-controlled memory placement (due to their address-dependent nature), oranytime updates.

Quasi-Caching [4] is very relevant to the topic of maintaining client-controlled coherencebetween a data source and cached copies of the data. Quasi-Caching assumes that each objecthas a computer that stores the most recent version and other computers store quasi-replicasthat may diverge. They considered both time-based and scalar value based divergence(coherence) models. In their work, the allowed divergence is specified by the user in afixed manner. The quasi-caching work describes when to update a client’s cached copybut does not deal with the issue of how to do so efficiently. Furthermore, this work alsodoes not support dynamically modifying the coherence model, or client-controlled memoryplacement.

Computer Supported Collaborative Work (CSCW) systems [14, 16, 48] share some ofthe features of InterAct (supporting interactive sessions across independent and potentiallyheterogeneous systems, update notification, etc.). However, most of these systems are tai-lored to a specific application, like cooperative engineering design, or distributed meetings.This has lead to a proliferation of isolated tools with little or no inter-operability.

There has also been some recent work on distributed data mining systems. TheKensington [21] architecture treats the entire distributed data as one logical entity andcomputes an overall model from this single logical entity. The architecture relies on stan-dard protocols (such as Java Database Connectivity (JDBC)) to move the data. The Intelli-miner [50] and Papyrus systems [19] are designed around data servers, compute servers,and clients as is the system presented in this work. All of the above systems rely ona message-passing-like interface for programming distributed data mining applications.InterAct provides a shared-object interface with features such as client-controlled coher-ence, memory placement, and anytime updates. In this paper, we limit ourselves to evaluatingInterAct on client-server data mining applications although the InterAct system can be usedfor a broader class of distributed applications.

4. Runtime framework

Shared data in InterAct are declared as complex-objects. Complex-objects are composed ofnodes that may be linked together. Nodes may be C-style structs, basic types, or some

Page 13: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 133

Figure 1. Interactive client-server mining.

predefined InterAct types. Complex-objects could include recursive data structures such asgraphs, trees, lists, arrays or collections of nodes. In figure 1, we describe a general-purposeinteractive mining algorithm mapped onto InterAct. In this example, the client has mappedthree complex-objects, an array, a directed acyclic graph (DAG), and a list representingthe shared data summaries, onto the virtual shared dataspace. An element in the list pointsto the DAG, represented by the connection. The server is responsible for creating andupdating these data summaries. The client specifies a coherence model when mapping thesummaries, synchronizes when required, and is responsible for the interactive queryingcomponent.

In InterAct, every complex-object moves through a series of consistent states, or versions.When a client first maps a shared complex-object, it specifies the desired coherence model.InterAct obtains a copy of the complex-object from the server on the first client access. Atthe beginning of each semantically meaningful sequence of operations, the client performsa synchronization operation, during which the system ensures that the local copy of thecomplex-object is “recent enough”, as determined by the specified coherence protocol. Ifnot, it obtains a new version from the server.

One of the principal innovations in the runtime system is the provision of the abilityto allow each individual process to determine when a cached copy of a complex-objectis “recent enough”, through the specification of one of a set of highly relaxed coherencemodels. Changes to complex-objects must be made using mutually exclusive access. Aslong as applications adhere to these synchronization requirements, InterAct transparently

Page 14: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

134 PARTHASARATHY AND DWARKADAS

handles all client-server communication including intra-object consistency and coherencemaintenance.

InterAct defines the following coherence models. One-Time Coherence specifies a one-time request for data (complex-object) by the client. No history need be maintained. This isthe default coherence type. Polled Coherence indicates that the client may request a currentversion when desired. The server may then attempt to reduce communication requirementsby keeping track of the staleness of the data cached by the client. Immediate Coherenceguarantees that the client will be notified whenever there are any changes to the mapped data.Diff-based Coherence guarantees that no more than x% of the nodes comprising a complex-object is out of date at the time of synchronization. Delta Coherence [46] guarantees thatthe complex-object is no more than x versions out of date at the time of synchronization.Temporal Coherence guarantees that the complex-object is no more than x real-time units outof date at the time of synchronization. In all cases, x can be specified by the client. Section 5details the use of the different coherence models by various data mining techniques.

The coherence models are motivated by the ability of a large number of data mining andother interactive applications to tolerate a certain level of data staleness. In all cases, theinterface involves the use of synchronization in order to bring data up-to-date, or to makemodifications to the data. The application programming interface (API) allows processes toacquire a read or a write lock on shared data. A read lock guarantees that the shared data isup to date subject to the coherence model requested. A write lock always guarantees strictcoherence.

In order to provide client-controlled memory placement, our API provides primitives bywhich each process (clients or servers) can locally remap nodes within a complex-objectto improve spatial locality. InterAct transparently handles the remapping as a byproduct ofsupporting address independence. We next discuss the InterAct API.

4.1. Interface

In figure 1, we describe the current interface available to the user within the two grayedrectangular regions. Our interface is essentially a set of template classes.Node is the templatein which user data (using Class User_Data) can be embedded and from which it canbe accessed (using access_node). InterAct_Object is the template with predefinedfunctions for creating a complex-object, remapping (remap_object) it in memory in alocality-enhancing manner, adding (add_node) nodes to an InterAct complex-object, anddeleting (delete_node) nodes from an InterAct complex-object. In addition, there arevarious methods that allow one to access the root node (get_root) of an InterAct object,identify (num_child) how many nodes are connected to a given node within a complex-object, and access such children nodes (child).

There are also functions for synchronizing (acquire/release read/write lock1)and modifying the required coherence type (cons_type). The User Data class is used as abase to define what a node contains. The node may be composed of basic data types andpointers to other complex-objects. The interface requires the user to identify these specialpointers (object_ptr, not in figure) to the system. We next describe how this interface canbe used to create, map, and manipulate summary structures.

Page 15: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 135

Figure 2. Server-side pseudo code: InterAct method-calls shown in bold.

4.1.1. Example usage

4.1.1.1. Server code. In this code fragment (figure 2) the contents of each node within thecomplex-object to be created is first declared. This is done by deriving a sub-class calledlattice_node from the abstract class User_Data as shown. Each lattice_node containsan itemset id and a count for that itemset. The main program in the server initializes theInterAct framework and formally declares and creates an InterAct complex-object calledLattice (using the above defined lattice_node). The server can then add, update, anddelete (not shown) nodes after acquiring the necessary lock on the complex-object.

For expository simplicity both the the AddItemset and UpdateItemset pseudo-codes as-sume that the tree is only one-level deep, i.e. only children of the root node have to be addedor updated. The AddItemset procedure first identifies the root of the complex-object usingthe get_root method, and then adds a child to the root using the add_node method. Thedata within the child node can then be accessed and added to by using the access_nodemethod. The UpdateItemset procedure first identifies the appropriate child node of the rootnode that needs to be updated (using the numchild and child methods). It then accessesand updates the corresponding child node using the access_node method.

4.1.1.2. Client code. Like the server, the client also has to define the lattice_node. Afterperforming the initialization operations the client can map a given object as shown in figure 3.The call to create a new InterAct Object specifies the coherence type (sync) and coherence

Page 16: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

136 PARTHASARATHY AND DWARKADAS

Figure 3. Client-side pseudo code: InterAct method-calls shown in bold.

parameters (par) if any in addition to specifying the object name. This call examines afile/database with information associating each object name with a server (IP address andport number). If the process is determined to be a client, the necessary communication withthe server is performed in order to retrieve the initial copy of the object. Once the object ismapped it can be synchronized with the server version by acquiring the appropriate lock.In the pseudo-code shown in figure 3, the client invokes a procedure called Prioritizedafter acquiring a read lock. After the procedure has been completed the client releases theread lock.Prioritized is a recursive procedure (described in [1, 33]) that identifies the X (X = 40

in the figure) most frequent associations and displays them. Each call to the procedure in-volves accessing the children of the current node and evaluating their support counts againstthe minimum support criteria (using the child and access_node methods). Nodes rep-resenting itemsets that meet the minimum support criteria, are added to the priority queue(priority is determined by support counts). Then the priority queue is dequeued as manytimes as the number of associations (X = 40) requested. This procedure uses the child,numchild, and access_node methods to access and read the data contained within thecomplex-object.

4.2. Memory management and access

Like any distributed object-based system, our interface identifies any pointers and theirassociated types to the runtime system in order to provide address independence. During

Page 17: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 137

complex-object creation, the complex-object’s internal representation is divided into dataand connection pages2 (this division is transparent to the user as long as the defined templatesare used to declare and access shared data).

Data pages contain all the nodes for a complex-object. Nodes are created and allocated asfixed-size structures so that array index arithmetic3 may be used to access them efficiently(allowing variable sized nodes would simply involve a slightly less efficient access mecha-nism to the node). Separating connection information for a node (number of nodes linkedto) allows the number of links to be variable and enables us to use array arithmetic on thedata pages. Each node contains a single pointer into the connection pages that identifiesthe set of nodes that node links to. Information in the connection pages identifies the nodeswithin the complex-object in terms of an index, making address independence feasible.Separating connection and data information also co-locates all the pointers, enabling theruntime system to perform efficient pointer swizzling [54].

Laying the data out in semi-contiguous order facilitates efficient address to node map-pings as long as node sizes are fixed, resulting in fast identification of changes to a complex-object as outlined below in Section 4.3. The use of node identifiers coupled with the abovescheme also enables fast node to address mapping that permits us to update node changesrapidly as well as maintain mappings independent of server mappings, as outlined belowin Section 4.7.

4.3. Object modification detection

The technique we use to detect modifications is similar to that used by multiple-writerpage-based software distributed shared memory systems [5, 8], except that we use a nodeas the granularity at which we detect modifications. At the start of every acquire of awrite lock (see figure 4), all relevant complex-object pages are marked read-only usingthe mprotect virtual memory system call. When a processor incurs a write fault, it createsa write notice (WN ) for the faulting page and appends the WN to a list of WNs asso-ciated with the current interval, or region encapsulated by an acquire and a release. Itsimultaneously saves a pristine copy of each page, called a twin, and enables write permis-sions [8]. When the lock is released, the twin is used to identify the nodes modified within theinterval.

At the release, all objects that have been modified (identified through the WN list) incre-ment their associated object timestamp (or version number). These objects are efficientlyidentified since our WN list is maintained as a hash table containing the <page address,object identifier> pairs. Since we ensure that complex-objects reside on separate pages,a write to a page corresponds to a write to a single complex-object. Modified nodes areidentified by comparing the modified pages to their twins. Comparison is thus limited onlyto those pages that are actually modified. These modified nodes are then communicatedto the object manager along with the latest version number in order to keep the manager’scopy up-to-date. The object manager has a timestamp (or version) map associated with eachobject. A timestamp map contains an entry for each node indicating the last time it wasmodified (see figure 4). Upon receiving modifications, the manager updates its copy of thedata as well as of the timestamp map.

Page 18: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

138 PARTHASARATHY AND DWARKADAS

Figure 4. Efficient shared data updates.

4.4. Updating an object

When asked for changes by the caching process, the object manager (server) compares itstimestamp map for the complex-object against the last time the client has been appraisedof an update. The result of the timestamp comparison is a run-length encoding of thenode data and node connections that have been modified, which constitute the diff of thecomplex-object (figure 4). Header information also specifies the complex-object identifier,and number of node data and connection updates.

On the client side, we maintain information corresponding to the objects that are mappedand where they are stored in local memory. On receiving a diff message, we update thecorresponding object by decoding the header to determine the object identifier. This objectidentifier is then used to determine the local location of the object. Data and connectioninformation for nodes within the object are similarly address independent.

The issue of containment (or aggregation) in an object-oriented design is an important onefor performance and correctness reasons. In our current implementation, when a client mapsa particular object, we update all nodes that belong to the particular object immediately.However, if a node within this object points to another object or another node within anotherobject, that object is not copied immediately. It is copied lazily upon client access or clientrequest. The InterAct interface implicitly gives the programmer control over what needs tobe copied immediately and what can be lazily copied. All the programmer has to do is tocreate separate complex-objects in this case.

4.5. Generating anytime updates

If a client request comes in while a write lock on the object is held by the server or managingprocess, one approach would be to wait4 till the server transaction commits before sending anupdate to the client. This may not be acceptable for applications requiring a quick response,especially when the lock is held for a long time. Our approach is to twin the entire object oncreation. When the write lock is released, the object twin is updated using the diff mecha-nism described above. If a client request comes in during a write lock, the system returns theupdate from the twin rather than the original object. During the application of the diff on the

Page 19: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 139

twin, the system returns the update from the original object. This ensures that the client rarelyhas to wait for the requested data in practice. This approach is costly in terms of space. Analternative approach which we considered is to twin only the pages modified (as describedin Section 4.3) during the write lock and deliver updates for modified pages from thesetwins. However, this approach involves pointer tracking and associated synchronizationand therefore is more costly in terms of time. One could potentially switch between the twoapproaches as a function of application requirements and system configuration.

4.6. Coherence maintenance

The runtime system optimizes data communication by using the coherence model specifiedby the user to determine when communication is required. The goal is to reduce messag-ing overhead and allow the overlap of computation and communication. Implementationof the Immediate Coherence, Polled Coherence, and One-Time Coherence guarantees arefairly straightforward. One-Time Coherence does not require any meta-data (timestamp)maintenance, nor does it require the server to generate object diffs. Polled Coherence isimplemented by having the client send the most recent timestamp of the object it has seen.Under Immediate Coherence the server keeps track of this information. Under both Polledand Immediate coherence only those nodes with timestamps greater than this value arecommunicated. The difference between Polled and Immediate coherence is that in the latterthe server notifies the client when a complex-object has been modified while the former hasno such notification protocol. This allows the runtime system under Immediate coherenceto check for notification messages before issuing an update request to the manager (server)eliminating some communication traffic. The upside to Polled Coherence is that the serverdoes not need to maintain client-specific state on a per complex-object basis.

Temporal Coherence is supported by having the runtime system on the client’s side pollfor updates every x time units, as defined by the user. To keep track of Diff-based Coherence,the server maintains a cumulative count of nodes per complex-object that are modified sincethe last client update. If this cumulative count exceeds a preset user-defined value (referredto as the diff parameter), the client is sent a notification (similar to Immediate Coherence)and a subsequent update. Delta Coherence is kept track of in a manner similar to Diff-basedCoherence. In these cases as well, the server has to maintain the last timestamp seen by therespective clients.

4.7. Memory placement

Different clients may have different mining agendas, leading to different data structureaccess. InterAct permits the clients to place the mapped data structure in memory in a localityenhancing manner by using the remap( )5 function. For example, if the structure is a treeand most of the client interactions are going to induce a breadth-first evaluation of the tree,then the tree can be placed in memory in a breadth-first fashion to improve cache locality.InterAct currently supports breadth-first, depth-first, and user-defined placement [40]. User-defined placement allows the programmer to define a condition that splits the nodes in anobject into separate sets of contiguous memory.

Page 20: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

140 PARTHASARATHY AND DWARKADAS

5. Applications

In past work [33, 38, 39, 41], we have shown that it is possible to design useful summarystructures for several mining applications so that subsequent queries can operate on thesesummary structures rather than the actual data. Within our framework of remote mining,these summary structures are generated and kept up-to-date by the data server, and subse-quently mapped and operated on by the client. In this work, we simulate the updates on theserver side according to real data and application update properties, described below.

5.1. Association and sequence mining

Given a database of transactions where each transaction consists of a set of items, associ-ation discovery [2] finds all the item sets that frequently occur together, and also the rulesamong them. Sequence discovery essentially involves association discovery over temporaldatabases. It aims to discover sets of events that commonly occur over a period of time.

For association and sequence mining, the summary structure we use is the Itemset Lat-tice [1, 33] (or Sequence Lattice [41]), which contains pre-mined patterns and the corre-sponding support6 information. Responses to user requests typically involve computing aconstraint-based subset of the entries in the lattice. Updates to the lattice are handled asdescribed in [41, 52]. Each incremental update, reflecting new data, typically combinesmultiple actual transactions, for performance reasons [33, 52]. There are two possible kindsof updates to the summary structure based on the type of mining being performed. Whenmining is performed on the entire database, new transactions are usually only added tothe database—we call these additive updates. When mining is performed on a window oftransactions in the database, changes in the database result in almost as many additions asdeletions to the window—we call these windowed updates. Both types of updates typicallyresult in changes to anywhere from 0.1% to 10% of the summary data structure. This isbecause in typical scenarios the number of transactions being added or deleted is a smallpercentage (0.1%–1%) of the total number of transactions being represented (and in the caseof sequence mining not all customers are part of each update), so the net impact on the sum-mary is relatively low. Additive updates mostly result in modifications to support countsand a few pattern additions and deletions. However, for windowed updates, the changestend to result in more associations being added and/or deleted. For these applications, sincethe user is usually interested in keeping track of less frequently occurring associations orsequences, a stricter coherence model such as Polled Coherence or Immediate Coherenceis generally preferred.

5.2. Discretization

Discretization has typically been thought of as the partitioning of the space induced byone (say X) or more continuous attributes (base) into regions (e.g., X < 5, X ≥ 5), whichhighlights the behavior of a related discrete attribute (goal). Discretization has been used forclassification in the decision tree context and also for summarization in situations where oneneeds to transform a continuous attribute space into a discrete one with minimum “loss”.

Page 21: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 141

Our program is an instance of 2-dimensional discretization (two base attributes, describedin [39]).

Interactions supported include generating an optimal discretization (based on entropyor classification error), and modifying the location and number of control points (whichpartition the two-dimensional base attribute space). The summary structure required tosupport such interactions efficiently is the joint probability density function (pdf) of thebase and goal attributes. This pdf is estimated at discrete locations. While several techniquesexist to estimate the density of an unknown pdf, the most popular ones are histogram, movingwindow, and kernel estimates [13]. We use the histogram estimate described in [13]. Theadvantage of this estimate is that it can be incrementally maintained in a trivial manner (ahistogram estimate is essentially the frequency distribution normalized to one). Moreover,the more complicated kernel estimates can easily be derived from this basic estimate [13].Each update corresponds to one transaction and every update modifies exactly one entry inthe array. Each update is simply a small perturbation on the pdf estimate and as such doesnot affect the quality of discretization significantly. Thus, this technique would benefit fromusing Diff-based Coherence without affecting the quality of the results. The diff parameterspecifies the amount of data that needs to change before it is significant to the application.

5.3. Similarity discovery in datasets

This application computes and maintains the similarity between two or more datasets. Suchmeasures of similarity are useful for clustering homogeneous datasets. In [38], we definethe similarity between two datasets to be a function of the difference between the set ofassociations induced by them, weighted by the supports of each association. To compute andmaintain the similarity between n datasets, the client maps the itemset lattices (containingthe association patterns and their supports) from each of the distinct data sources and thencomputes the pairwise similarity measures.

It has been noted [12] that incorporating domain bias in the similarity measure via suitableinteractions can be very useful. In this application, we support the following operations:similarity matrix re-generation after a data structure update, identifying influential attributes,and constraining the similarity probe set via constraint queries on the itemset lattices. Incre-mentally maintaining the association lattices has already been discussed above. However,since this application is more interested in general patterns, even if a large percentage ofthe mapped summary structure is modified over a period of several updates, it has beenshown that the percentage change in the measured similarity is not significantly affected.The measured similarity directly correlates more to the magnitude of the change in data.This magnitude is not directly measurable without a large amount of overhead. However,the use of Delta Coherence captures this application’s requirements by allowing the data tobe several versions out of date without affecting result quality.

5.4. Other applications

We have described how one can define summary structures for four key data mining tasks.The same principles may be applicable to other data mining tasks as well and this issue is

Page 22: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

142 PARTHASARATHY AND DWARKADAS

Table 1. Server update properties.

Application Object size UPS %Change Change type

Association mining 3.3 MB 0.5 4% ADD/DEL

Sequence mining 1.0 MB 1 10% MOD/ADD

Discretization 0.5 MB 100 0.002% MOD

Similarity(1) 0.5 MB 5 10% MOD/ADD

Similarity(3) 3X0.5 MB 5 0.33% MOD/ADD

under investigation. For instance, for clustering one may choose to create, maintain (server-side), and map (client-side) cluster descriptors. For decision-tree classifiers one could eithercreate, maintain, and map the actual trees themselves or a compact representation of thetree such as an FFT representation [27]. For both neural networks and bayesian networksthe actual network descriptors (weights and links) could be mapped.

5.5. Properties of server updates

We outline the exact nature of the updates7 used for our experiments for each application inTable 1. The second column refers to the size of the summary structure when the server startsup. UPS corresponds to updates per second on the server side. The column labeled %Changecorresponds to the percentage of nodes in the summary structure that are changed over asingle update, and are representative of realistic workloads for each of the applications. ForSimilarity Discovery, we computed the similarity among four databases. Updates on one ofthese databases (Similarity(1)) had different properties from the other three (Similarity(3)).The last column in the table refers to the dominant change type of the given update. ADDrefers to the fact that the update adds new nodes, DEL refers to the fact that the update deletesnodes, MOD refers to the fact that the update modifies existing nodes. The order in whichchange types appear in column five of the table are in decreasing order of dominance. Forexample, ADD/DEL refers to the fact that on average, executing the corresponding updateresults in more additions than deletions to the summary structure.

6. Experimental evaluation

We evaluate our framework in a distributed environment consisting of SUN workstationsconnected by 10 or 100 Mbps switched Ethernet. Unless otherwise stated, the clients use a100 Mbps link, and are 270 MHz UltraSparc IIi machines. The clients in each applicationinteract with the server by sharing the summary data structures with the server. The servercreates the summary data structure and updates it corresponding to changes in the database(which we emulate). The client maps these data structures using one of the provided set ofcoherence models. Updates are then transmitted to the client according to the coherencemodel chosen.

Page 23: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 143

6.1. Runtime overhead

We evaluated the overhead imposed by our system on the server side during normal executionwithout client connections. To evaluate this overhead, we compared the test applicationwritten using the InterAct interface against a program written using standard C++. UsingUNIX malloc with the C++ program, we found that the application running on top of ourframework executed slightly faster due to our improved memory allocation policy. We usea special memory placement library [40] in our system that improves locality and thereforecache performance. Comparing against the standard C++ program with calls to the UNIXmalloc replaced by calls to our memory placement library, we found the runtime overheadimposed by using our template interface (which involves more indirect accesses) to be lessthan 5% for our application suite.

6.2. Client-side caching

In typical client-server applications, the client makes a request to the server, the servercomputes the result, and then sends the result back to the client. Since the interactions inour applications are often iterative in nature, caching the data structure on the client sideso that repeated accesses may be performed locally eliminates overhead and delays due tonetwork latency and server load. The potential gain from client-side caching depends on anumber of factors: the size of the shared data, the speed of the client, the network latency,and the server load.

In this experiment, we ran each of our applications under the following scenarios:

1. Client-Side Caching (CSC): the client caches the summary structure and executes thequery on the local copy.

2. Server Ships Results to Client (SSRC): the client queries the server and the server shipsthe results back to the client. This scenario is similar to the use of an RPC mechanism.In order to better understand the impact of server load, we varied the number of clientsserviced by the server from one (SSRC) to eight (Loaded-SSRC). For each of the appli-cations considered, Associations, Sequences, Discretization, and Similarity, the averagesize of the results shipped by the server was 1.5 MB, 0.25 MB, 0.4 MB and 0.75 MBrespectively.

We measured the time to execute each query under both scenarios. We evaluated eachscenario on client machines that were either an UltraSparc (143 MHz) machine or anUltraSparc IIi (270 MHz). In each case, our server was an 8-processor 333 MHz UltraSparcIIi machine. Results are presented in Tables 2 and 3 for these scenarios under differentnetwork configurations. We varied the network configuration by choosing clients that areconnected to the server via a 10 Mbps or a 100 Mbps Ethernet network.

The results in Tables 2 and 3 show that client-side caching is beneficial for all but a fewof the cases. In particular, the following trends are observed. Client-side caching is morebeneficial under the following scenarios: the network bandwidth is low (speedups fromclient-side caching under the 10 Mbps configuration are larger (1.5 to 23) than the 100 Mbps

Page 24: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

144 PARTHASARATHY AND DWARKADAS

Table 2. Time (in seconds) to execute query: 100 Mbps interconnect.

Client(143) Client(270)

Application CSC SSRC L-SSRC CSC SSRC L-SSRC

Association 2.4 1.6 2.5 1.5 1.4 2.3

Sequence 0.58 0.55 0.86 0.35 0.5 0.73

Discretization 0.87 0.67 1.08 0.55 0.6 0.98

Similarity 0.35 0.55 0.98 0.11 0.37 0.94

Table 3. Time (in seconds) to execute query: 100 Mbps interconnect.

Client(143) Client(270)

Application CSC SSRC L-SSRC CSC SSRC L-SSRC

Association 2.4 4.05 7.2 1.5 2.5 5.1

Sequence 0.58 0.85 1.35 0.35 0.63 1.18

Discretization 0.87 1.35 2.75 0.55 0.94 1.6

Similarity 0.35 1.5 2.7 0.11 0.9 2.4

numbers (0.6 to 9)), the server is loaded (comparing the L-SSRC column (speedups of 1.1to 9) with the SSRC column (speedups of 0.6 to 3.5) with a 100 Mbps network), the clientis a fast machine (comparing the columns involving the 270 MHz client versus the slowerclients), or the time to execute the query is low (comparing the row involving similaritydiscovery with the row involving association mining). In other words, as expected, thebenefits from client-side caching are a function of the computation/communication ratio.The lower the ratio, the greater the gain from client-side caching.

These results are presented just to illustrate the fact that InterAct enabling such caching isvery useful for such applications especially when deployed on the Internet. The results fromthis experiment underscore two key aspects. First, it is possible to design useful summarystructures that summarize the dataset effectively for mining purposes. Such summary struc-tures can be accessed efficiently to answer a set of useful queries rapidly and efficiently andare significantly smaller than the original datasets they summarize. Second, shipping thesummary structure to the client offloads much of the computational work from the serverand accelerates the query processing by eliminating the client-server network delay.

6.3. Coherence model evaluation

In this section, we evaluate the benefits of using relaxed coherence models. In our exper-iments, the clients map the shared summary structure, perform iterative requests simulat-ing a realistic data mining interaction, and synchronize with the server periodically. Theserver concurrently updates the shared data structure, reflecting changes to the actual data.

Page 25: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 145

Figure 5. Coherence model evaluation.

Figure 5 reports the average synchronization time (time for a read acquire_lock, definedin Section 4.1, and used to bring the complex-object up-to-date according to the desiredcoherence model) for each of the applications under different coherence models. In theseexperiments, all clients use the same coherence model. We measured the synchronizationtime over a window of several (35 to be precise8) synchronizations and averaged this time.This average synchronization time in seconds is represented on the Y axis. The X axiscorresponds to the number of clients in the experiment.

Tables 4 through 7 present a breakdown of the total number of requests made by clients,and the total amount of data communicated under each of these coherence models for eachof the applications. The first column represents the number of clients in the system. The

Table 4. Association performance breakdown.

Polled Immediate

#C Total data # Requests Total data # Requests

8 145306 × 32 280 141989 × 32 70

4 74552 × 32 140 74552 × 32 35

2 37276 × 32 70 37276 × 32 16

1 18638 × 32 35 18638 × 32 7

Page 26: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

146 PARTHASARATHY AND DWARKADAS

Table 5. Sequence mining performance breakdown.

Polled Immediate

#C Total data # Requests Total data # Requests

8 85414 × 40 280 33172 × 40 140

4 43402 × 40 140 18299 × 40 53

2 20410 × 40 70 9044 × 40 34

1 9744 × 40 35 4668 × 40 14

Table 6. Discretization performance breakdown.

Polled Diff(15)

#C Total data # Requests Total data # Requests

8 2013 × 24 280 1640 × 24 77

4 973 × 24 140 896 × 24 44

2 393 × 24 70 374 × 24 18

1 207 × 24 35 178 × 24 8

Table 7. Similarity performance breakdown.

Polled Diff(100) Delta(15)

#C Total data # Req Total data # Req Total data # Req

8 111415 × 32 1120 33792 × 32 69 18448 × 32 34

4 52213 × 32 560 15841 × 32 31 8920 × 32 17

2 25598 × 32 280 9976 × 32 17 4224 × 32 8

1 12840 × 32 140 4224 × 32 9 2112 × 32 4

subsequent columns represent the cumulative sum of the data sent out by the server toall the clients and the total number of requests made by all the clients, with each clientperforming 35 synchronization operations, for each of the coherence models evaluated forthe application.

The average synchronization time can be broken down into two components, the com-munication overhead and the time spent waiting at the server. Due to limited resources,we could evaluate our work only on up to eight clients. Since the server is multi-threadedand has up to eight processors, we see very little increase in server load overheads withan increase in the number of clients. However, as the number of clients serviced by theserver increases, the average synchronization times increase due to contention for networkresources, as well as due to the fact that the server modification window is also consequentlyhigher. This underscores the need to reduce communication and server load.

In order to evaluate the effect of the coherence model on performance, we begin byevaluating the effectiveness of using diffs (using Polled Coherence) as opposed to resending

Page 27: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 147

the entire complex-object (using One-Time Coherence), the strawman approach taken bymost existing commercial object-oriented systems.9 We found that sending diffs could be10–30 times faster than resending the entire complex-object for reasonable-sized (10% ofnodes) changes made by the server. The size of the messages used to update the clients ismuch smaller, resulting in reduced communication overhead. The gains due to reduction incommunication cost also reflect the reduction in overhead from client-server flow controldue to finite buffer sizes.

We next evaluate the impact of having the flexibility of using different client-controlledcoherence models. For both association mining and sequence mining, we evaluated polledcoherence and immediate coherence. Immediate coherence performs several factors betterthan polled coherence for these two applications. Polled coherence results in the clientsending a request to the server at every synchronization point. Immediate coherence dictatesthat the server send a notification message to the client. On seeing the notification messageat a synchronization point, the client polls the server for the update.

For association mining, the data communicated in both the immediate and polled casesis the same (Table 4). The synchronization rate is a function of the query processing timeand the synchronization time. In this application, the query processing time dominates,resulting in the server modification window (the time between requests for updates) beingroughly the same under both protocols. Additionally, the server updates in this applicationprimarily involve adding and deleting nodes (see Table 1), explaining the total size of thechanges being the same. However, there is a 75% decrease in the number of messages sentout under the former model. The synchronization rate is higher than the rate at which theserver modifies data, resulting in unnecessary requests when using polled coherence. Thisresults in a 3-fold performance improvement.

Sequence mining also sees a 3-fold improvement in synchronization time when usingimmediate coherence as opposed to polled coherence. There is a reduction not only in thenumber of requests but also in the total data communicated. In this application, the queryprocessing time is small. Hence, when synchronization time is reduced, the server modifica-tion window is also smaller. In addition, this application primarily modifies existing nodes(see Table 1). These factors combine to reduce the amount of data. The synchronizationrate in this application is fairly close to the rate at which the server modifies data. Hence,due to timing variations in receiving notifications with immediate coherence, some of thesynchronization operations remain local and do not request updates, resulting in a 2-foldreduction in the number of requests.

6.3.1. Diff-based and delta coherence model evaluation. While association mining andsequence mining benefit from sending updates rather than the entire complex-object, theyboth require the client to have the latest copy of the shared data. However, discretization andsimilarity discovery can tolerate some staleness in the interaction structure without losingmuch accuracy.

Discretization can make use of the diff-based coherence model since we know that thequality of discretization is not affected by small changes in the shared data (we quantifythis in the next paragraph). Similarity discovery does not benefit as much from diff-basedcoherence since every server update is likely to modify roughly 10% of one of the summary

Page 28: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

148 PARTHASARATHY AND DWARKADAS

structures, or 1600 nodes (see Table 1), which far exceeds the diff parameter that we use.The number in brackets for diff-based coherence in figures 5 and 7 is the diff parameter, orthe number of nodes that must be modified to trigger an update to the client (set to 15).

However, similarity discovery benefits more from the delta coherence model due tothe fact that the result quality is affected more by the magnitude of the change in datathan by the amount of data that has changed. The magnitude of change per server updateis small, while the number of nodes changed is about 10%. The number in brackets fordelta coherence is the maximum number of server updates (or versions: corresponding toreleasing a write lock) between updates to the client. For these applications, we comparethe average synchronization times with polled coherence (object resends are always goingto be worse than polled coherence).

Referring to figure 5 for discretization, we find that diff-based coherence, when using adiff parameter of 15, was on average twice as efficient as polled coherence. This differenceincreases as the diff parameter is increased. For a diff parameter of 100 we found that theaverage synchronization time is 15–20 fold better. Delta coherence is particularly effectivefor similarity discovery, outperforming polled coherence by three orders of magnitude atlow server load.

In both of these applications (referring to Tables 6 and 7) there is a reduction not onlyin the number of requests but also in the total data communicated. While part of the reasonfor this behavior is the same as discussed earlier for association and sequence mining, themain cause for this reduction is the applications’ tolerance for staleness as specified by therelaxed coherence models.

6.4. Client-controlled coherence

An important contribution of our system is the fact that different clients may map the sameshared structure using different coherence models. As an example, a client interested insimilarity discovery involving a particular database could map the association lattice (asdescribed in Section 5) of the database using the delta coherence model. The same latticemay be mapped by another client for the purpose of association mining using the polledcoherence model (figure 6).

In this experiment, we considered the following configurations. In the polled configura-tion, all eight of our clients used polled coherence. In the delta configuration, all eight useddelta coherence. In the mixed configuration, four used delta coherence and four used polledcoherence.

We found the average synchronization time of clients under polled coherence in theMixed configuration to be slightly lower due to reduced traffic, and that for clients usingdelta coherence to be slightly higher due to the extra traffic from the clients using polledcoherence. The average synchronization times of clients under delta coherence in the Mixedexperiment were two orders of magnitude lower than those for when all the clients mappedthe data using polled coherence. If client-controlled coherence were not used, the serverwould have to adhere to the strictest coherence model for correctness, in this case, polledcoherence, resulting in much reduced performance. Thus, by using the coherence modelrequired by each client, the server is able to improve overall performance.

Page 29: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 149

Figure 6. Effect of client controlled coherence on similarity discovery.

6.5. Interaction quality

The above results indicate that encoding knowledge about application behavior by choosingthe appropriate coherence model can improve performance. This improved performancecomes about due to reduced resource consumption (server processing time, low networkutilization). However, to better understand the implications of lower resource usage onemust also evaluate the corresponding loss in result quality for the application when usingmore relaxed coherence models. Such resource-aware computing issues have been studiedin the context of other domains such as mobile-computing [43] and multimedia-computing[31, 32] and we apply a similar analysis below for discretization and similarity discovery.

For each of these two applications, we plotted the result quality under the differentcoherence protocols over a certain period of time (demarcated by server updates). Figure 7

Figure 7. Result quality.

Page 30: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

150 PARTHASARATHY AND DWARKADAS

presents these results. For discretization, the result quality is represented by the classificationerror—the fraction of points in the space that are mis-classified. The plot for polled coherencerepresents the best achievable classification error for the algorithm. The plot for diff-basedcoherence (with a diff parameter of 100 as opposed to 15) is the error obtained when usinga relaxed coherence model. Clearly, for this application the loss in quality is not significant.In fact, it is off by less than 1% of the exact error at all instances in time even for such a highdiff parameter. For similarity discovery, the result quality is represented by the similarityvalue, which represents the distance between two datasets. The similarity value using deltacoherence is no worse than 2% off the value obtained using polled coherence at all instancesin time. For more details on the network-aware adaptive nature of these two application,readers are refered to a recent paper [34].

6.6. Effect of modification rate

In order to evaluate the impact of the server modification rate on our choice of coherencemodel, we modified the similarity discovery experiment described in the previous sectionsin the following ways. Two of the four lattices that we map have server update characteristicsas described in Table 1, Similarity (1). For the other two lattices, we used the server updatecharacteristics described under Similarity (3), where each server update modifies less that0.1% of the data structure. Each lattice is maintained by a separate server process runningon our 8-processor server.

We then evaluated the average synchronization time for one client in the system whilevarying the server updates per second. We varied diff, the diff parameter or the number ofnodes that differ before an update is sent to the client, from 10 to 100 for diff-based coherence(see figure 8). We also varied delta, the number of virtual time intervals between successiveupdates to the client, from 10 to 100 for delta coherence. The larger the diff/delta values, thelower the average synchronization time. The delta-based approach still does better for thisapplication at a low server update rate since it minimizes the communication with the

Figure 8. Effect of transaction rate on similarity performance.

Page 31: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 151

client. However, at larger update rates (crossover point 230 ups), diff-based coherence(diff num = 100) begins to perform better than delta coherence. The reason for this isthat at higher update rates, updates to the client are sent too frequently (in other words, adelta num of 100 is too low) for all the complex-objects. However, since only two of thefour lattices have a large percentage modified per update, for a diff value of 100, only thesetwo lattices will cause updates to be sent to the client. The other two lattices are not modifiedat the same rate, resulting in lower average synchronization times when using diff-basedcoherence with a diff value of 100.

This experiment highlights the influence of server update properties on the choice ofa coherence model, as well as the importance of being able to dynamically change cohe-rence requirements as the application behavior changes. Choosing among relaxed coherencemodels is not only a function of what the application can tolerate in terms of data staleness,but also a function of how much and how often changes are made to the shared summarystructure, in order to be effective.

6.7. Locality enhancements

Locality enhancing memory placement is especially useful when the client uses a predefinedtraversal of the shared summary structure. We illustrate its benefits using association mining,where we found that for different queries, a different mapping of the data structure presentedthe best results. For example, when computing the most frequent associations (sometimesreferred to as quantified associations) a breadth first representation of the lattice is mostoften desired since the more frequent an association is the closer it is likely to be to theroot of the lattice. One can be more exact using the user-defined memory placement for thisquery. When computing inclusive associations whereby one desires to find all associationsinvolving a particular item then the traversal is typically a depth-first traversal. In suchcases one may want to use a depth first memory placement of the lattice. These differentassociation queries are commonly used in online association mining [1, 33]. For thesequeries, we found up to a 20% improvement in execution times by remapping the datastructure according to the best mapping strategy.

7. Conclusions and future directions

We have described a general runtime framework that supports efficient data structure shar-ing among distributed and interactive components of client-server applications. While thesystem is general enough to support a wide range of application domains, in this paper wehave demonstrated the utility of the system for, and evaluated its performance on, a suite ofinteractive data mining applications. The runtime interface enables clients to cache relevantshared data locally, resulting in faster (up to an order of magnitude) response times to inter-active queries. In the event that this shared data is modified, the complexity of determiningexactly what data to communicate among clients and servers, as well as when that data mustbe communicated, is encapsulated within the runtime system. Each process has the abilityto use the dynamically modifiable relaxed coherence mechanisms to encode application-specific knowledge about sharing behavior and requirements on a per data structure basis.

Page 32: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

152 PARTHASARATHY AND DWARKADAS

This information can then be used by the runtime to further reduce communication timepotentially by several orders of magnitude. For those applications in our test suite thatcan take advantage of the more relaxed coherence protocols, we have also shown that thedegradation in result quality is less than 2%. In addition, the anytime update and client-controlled memory placement features made possible by the address-independent virtualmemory-based design of the runtime system help improve the performance of several datamining applications.

Further refinement of the system involves improved support for heterogeneous platformsas well as integration with tightly-coupled software shared memory systems [9, 42]. Thelatter integration will provide the ability for remote satellites to interact with computationallyintensive components that require tightly-coupled parallel processing. The resulting system,called InterWeave [10], is on-going work and represents a merger and extension of ourprevious Cashmere [49] and InterAct systems, combining hardware coherence within smallmultiprocessors, Cashmere-style lazy release consistency within tightly coupled clusters,and InterAct-style version-based coherence for distributed shared segments.

Acknowledgments

Thanks to the anonymous reviewers for their useful feedback, which enabled us to improvethe quality of this article. This work was supported in part by NSF grants CDA-9401142,EIA-9972881, CCR-9702466, CCR-9988361, and CCR-9705594; and an external researchgrant from Compaq.

Notes

1. Note that our implementation of a write lock is relaxed, in the sense that readers are permitted during a lockheld in write mode.

2. Each complex-object is mapped to a disjoint set of pages, which enables our system to transparently detectchanges to objects using virtual memory hooks. Since we are dealing with applications where the complex-objects are reasonably large relative to the size of a page, this does not result in memory wastage.

3. Note that since complex-objects can dynamically change in size, all the pages for a complex-object need notbe contiguous, so a slightly modified form of array indexing is needed.

4. In order to guarantee an atomic update, the data cannot be sent in an as is condition, as partial changes to theobject may have occurred concurrently.

5. Clients need to execute this only once. Subsequent updates from the server are automatically handled correctlyby our system’s address translation mechanisms.

6. Support is the number of times the pattern occurs in the dataset.7. The datasets used are described in [37].8. Going to larger window sizes did not affect the average synchronization times for our workloads.9. Object techniques involving re-executing methods on all cached copies are not possible on data mining applica-

tions, since updates use large amounts of I/O, with the data residing only on the server. Hence, these approachesare not suitable for such applications.

References

1. C. Aggarwal and P. Yu, “Online generation of association rules,” in IEEE International Conference on DataEngineering, Feb. 1998.

Page 33: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 153

2. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo, “Fast discovery of association rules,”in Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), MIT Press: Cambridge, MA,1996.

3. S. Ahuja, N. Carreiro, D. Gelernter, and V. Krishnaswamy, “Matching language and hardware for parallelcomputation in the Linda machine,” IEEE Transactions on Computers, vol. 37, no. 8, pp. 896–908, 1988.

4. R. Alonso, D. Barbara, and H. Garcia-Molina, “Data caching issues in an information retrieval system,” ACMTODS, vol. 15, no. 3, pp. 359–384, 1990.

5. C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel, “TreadMarks: Sharedmemory computing on networks of workstations,” IEEE Computer, vol. 29, no. 2, pp. 18–28, 1996.

6. H.E. Bal, M.F. Kaashoek, and A.S. Tanenbaum, “Orca: A language for parallel programming of distributedsystems,” IEEE Transactions on Software Engineering, pp. 190–205, June 1992.

7. M. Carey, D. DeWitt, J. Naughton, M. Solomon et al., “Shoring Up Persistent Applications,” in Proc. of the1994 ACM SIGMOD Conference, 1994.

8. J.B. Carter, J.K. Bennett, and W. Zwaenepoel, “Implementation and performance of Munin,” in Proceedingsof the 13th ACM Symposium on Operating Systems Principles, Oct. 1991, pp. 152–164.

9. D. Chen, S. Dwarkadas, S. Parthasarathy, E. Pinheiro, and M.L. Scott, “InterWeave: A middleware systemfor distributed shared state,” in Fifth Workshop on Languages, Compilers, and Runtime Systems (LCR) 2000,Rochester, NY, May 2000.

10. D. Chen, C. Tang, X. Chen, S. Dwarkadas, and M. Scott, “Beyond S-DSM: Shared state for distributedsystems,” URCS Technical Report 744, University of Rochester, March 2001.

11. D.E. Culler and J.P. Singh, Parallel Computer Architecture—A Hardware/Software Approach, MorganKaufmann: San Mateo, CA, 1999.

12. G. Das, H. Mannila, and P. Ronkainen, “Similarity of attributes by external probes,” in Proceedings of the 4thSymposium on Knowledge Discovery and Data-Mining, 1998.

13. L. Devroye, A Course in Density Estimation, Birkhauser: Boston, MA, 1987.14. P. Dewan and J. Riedl, “Towards computer-supported concurrent software engineering,” IEEE Computer,

vol. 26, no. 1, 1993.15. A. Dingle and T. Partl, “Web cache coherence,” in Proceedings of 5th WWW Conference ( journal version:

IJCN), 1997.16. G. Fitzpatrick, S. Kaplan, and W. Tollone, “Work, locales and distributed social worlds,” in European Con-

ference on Computer Supported Collaborative Work, 1995.17. M.J. Franklin, Client Data Caching: A Foundation for High Performance Object Database Systems, Kluwer

Academic Publishers: Dordrecht, 1996.18. K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, “Memory consistency and

event ordering in scalable shared-memory multiprocessors,” in Proceedings of the 17th Annual InternationalSymposium on Computer Architecture, May 1990, pp. 15–26.

19. R. Grossman, S. Bailey, S. Kasif, D. Mon, A. Ramu, and B. Malhi, “Design of Papyrus: A system forhigh performance, distributed data mining over clusters, meta-clusters and super-clusters,” in Proceedings ofWorkshop on Distributed Data Mining, alongwith KDD98, Aug. 1998.

20. D. Grunwald, B. Zorn, and R. Henderson, “Improving the cache locality of memory allocation,” ACMSIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 1993, pp. 177–186.

21. Y. Guo, S. Rueger, J. Sutiwaraphun, and J. Forbes-Millot, “Meta-learning for parallel data mining,” in Pro-ceedings of the Seventh Parallel Computing Workshop, 1997.

22. Y. Huang, R. Sloan, and O. Wolfson, “Divergence caching in client-server architectures,” IEEE Conf. onParallel and Distributed Information Systems, 1994.

23. L. Iftode, C. Dubnicki, E.W. Felten, and K. Li, “Improving release-consistent shared virtual memory usingautomatic update,” High Performance Computer Architecture, pp. 14–25, Feb. 1996.

24. K.L. Johnson, M.F. Kaashoek, and D.A. Wallach, “CRL: High-performance all-software distributed sharedmemory,” in Proceedings of the 15th ACM Symposium on Operating Systems Principles, Dec. 1995, pp. 213–228.

25. A.D. Joseph, A.F. deLespinasse, J.A. Tauber, D.K. Gifford, and M.F. Kaashoek, “Rover: A toolkit for mobileinformation access,” in 15th SOSP, Dec. 1995.

Page 34: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

154 PARTHASARATHY AND DWARKADAS

26. E. Jul, H. Levy, N. Hutchinson, and A. Black, “Fine-grained mobility in the Emerald system,” ACM Trans-actions on Computer Systems, vol. 6, no. 1, pp. 109–133, 1988.

27. H. Kargupta, B. Park, D. Hershberger, and E. Johnson, “Collective data mining: A new perspective towarddistributed data analysis,” in Advances in Distributed and Parallel Knowledge Discovery, Kargupta and Chan(Eds.), 2000.

28. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo, “Finding interesting rules fromlarge sets of discovered association rules,” in 3rd Intl. Conf. Information and Knowledge Management,Nov. 1994, pp. 401–407.

29. L. Kontothanasis, G. Hunt, R. Stets, N. Hardavellas, M. Cierniak, S. Parthasarathy, W. Meira, S. Dwarkadas,and M. Scott, “VM-based shared memory on low-latency, remote-memory-access networks,” in PROC of the24TH ISCA, June 1997.

30. B. Liskov, A. Adya, M. Castro, M. Day, S. Ghemawat, R. Gruber, U. Maheshwari, A. Meyers, and L. Shrira,“Safe and efficient sharing of persistent objects in Thor,” in SIGMOD, 1996.

31. C. Mercer, S. Savage, and H. Tokuda, “Processor capacity reserves: Operating system support for multimediaapplications,” in Proceedings of the IEEE International Conference on Multimedia Computing and Systems,1994.

32. K. Nahrstedt, H. Chu, and S. Narayan, “QoS-aware resource manangement for distributed multimedia appli-cations,” Journal on High-Speed Networking, vol. 8, no. 3/4, pp. 227–255, 1998. IOS Press.

33. S. Parthasarathy, “Active data mining in a distributed setting,” PhD Dissertation, University of Rochester,1999.

34. S. Parthasarathy, “Towards network-aware data mining,” in International Workshop on Parallel and DistributedData Mining, alongwith IPDPS 2001.

35. S. Parthasarathy and S. Dwarkadas, “InterAct: Virtual sharing for interactive client-server applications,” inFourth Workshop on Languages, Compilers, and Runtime Systems (LCR), May 1998.

36. S. Parthasarathy and S. Dwarkadas, “Shared state for client server mining,” in SIAM International Conferenceon Data Mining, 2001.

37. S. Parthasarathy, S. Dwarkadas, and M. Ogihara, “Active mining in a distributed setting,” in Workshop onParallel and Distributed KDD Systems, 1999.

38. S. Parthasarathy and M. Ogihara, “Clustering homogeneous distributed datasets,” Fourth Practical Applica-tions of Knowledge Discovery and Data Mining (PKDD), 2000.

39. S. Parthasarathy, R. Subramonian, and R. Venkata, “Generalized discretization for summarization and classi-fication,” in PADD, Jan. 1998.

40. S. Parthasarathy, M. Zaki, and W. Li, “Memory placement techniques for parallel association mining,” inProceedings of the 4th Symposium on Knowledge Discovery and Data-Mining, 1998.

41. S. Parthasarathy, M. Zaki, M. Ogihara, and S. Dwarkadas, “Incremental and interactive sequence mining,”ACM Conference on Information and Knowledge Management, 1999.

42. E. Pinheiro, D. Chen, S. Dwarkadas, S. Parthasarathy, and M.L. Scott, “S-DSM for heterogeneous ma-chine architectures,” in Second Workshop on Software Distributed Shared Memory, Santa Fe, NM, May2000.

43. M. Satyanarayanan and D. Narayanan, “Multi-fidelity algorithms for interactive mobile applications,” in ThirdInternational Workshop on Discrete Algorithms and Methods in Mobile Computing and Communications,Seattle, WA, Aug. 1999.

44. I. Schoinas, B. Falsafi, A.R. Lebeck, S.K. Reinhardt, J.R. Larus, and D.A. Wood, “Fine-grain access control fordistributed shared memory,” in Proceedings of the 6th Symposium on Architectural Support for ProgrammingLanguages and Operating Systems, Oct. 1994, pp. 297–306.

45. M. Shapiro, S. Kloosterman, and F. Riccardi, “PerDiS—A persistent distributed store for cooperative appli-cations,” in Proceedings of the Third Cabernet Plenary Workshop, Rennes, France, April 1997.

46. A. Singla, U. Ramachandran, and J. Hodgins, “Temporal notions of synchronization and consistency inbeehive,” in PROC of the 9TH SPAA, June 1997.

47. R. Srinivasan, C. Liang, and K. Ramamritham, “Maintaining temporal coherency of virtual data warehouses,”in IEEE Real-Time Systems Symposium (RTSS98), Dec. 1998.

48. D. Sriram, R. Logcher, N. Groleau, and J. Chernoff, “Dice: An object oriented programming environment forcooperative engineering design,” AI in Enginnering Design, vol. 3, Academic Press: 1992.

Page 35: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

SHARED STATE FOR DISTRIBUTED INTERACTIVE DATA MINING APPLICATIONS 155

49. R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and M. Scott,“Cashmere-2L: Software coherent shared memory on a clustered remote-write network,” in Symposiumon Operating Systems Principles, Oct. 1997.

50. R. Subramonian and S. Parthasarathy, “A framework for distributed data mining,” in Proceedings of Workshopon Distributed Data Mining, alongwith KDD98, Aug. 1998.

51. D.B. Terry, M.M. Theimer, K. Peterson, A.J. Demers, M.J. Spreitzer, and C.H. Hauser, “Managing updateconflicts in bayou, a weakly connected replicated storage system,” in 15th SOSP, Dec. 1995.

52. S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka, “Incremental updation of association rules,” in KDD97,Aug. 1997.

53. M. vanSteen, P. Homburg, and A.S. Tanenbaum, “The architectural design of Globe: A wide-area distributedsystem,” in Technical Report (Vrije University) IR-431, March 1997.

54. P.R. Wilson, “Pointer swizzling at page fault time: Efficiently and compatibly supporting huge address spaceson standard hardware,” in International Workshop on Object Orientation in Operating Systems, Sept. 1992.

55. M.J. Zekauskas, W.A. Sawdon, and B.N. Bershad, “Software write detection for distributed shared memory,”in Proceedings of the First USENIX Symposium on Operating System Design and Implementation, Nov. 1994,pp. 87–100.

Page 36: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Page 37: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Distributed and Parallel Databases, 11, 157–180, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

RACHET: An Efficient Cover-Based Merging ofClustering Hierarchies from Distributed Datasets∗NAGIZA F. SAMATOVA [email protected] OSTROUCHOV [email protected] GEIST [email protected] Science and Mathematics Division, Oak Ridge National Laboratory,† P.O. Box 2008, Oak Ridge,TN 37831, USA

ANATOLI V. MELECHKO [email protected] Engineering and Nanoscale Technologies Group, Oak Ridge National Laboratory,P.O. Box 2008, Oak Ridge, TN 37831, USA

Abstract. This paper presents a hierarchical clustering method named RACHET (Recursive Agglomeration ofClustering Hierarchies by Encircling Tactic) for analyzing multi-dimensional distributed data. A typical clusteringalgorithm requires bringing all the data in a centralized warehouse. This results in O(nd) transmission cost, wheren is the number of data points and d is the number of dimensions. For large datasets, this is prohibitively expensive.In contrast, RACHET runs with at most O(n) time, space, and communication costs to build a global hierarchy ofcomparable clustering quality by merging locally generated clustering hierarchies. RACHET employs the encir-cling tactic in which the merges at each stage are chosen so as to minimize the volume of a covering hypersphere.For each cluster centroid, RACHET maintains descriptive statistics of constant complexity to enable these choices.RACHET’s framework is applicable to a wide class of centroid-based hierarchical clustering algorithms, such ascentroid, medoid, and Ward.

Keywords: clustering distributed datasets, distributed data mining

1. Introduction

Clustering of multidimensional data is a critical step in many fields including data mining [6],statistical data analysis [1, 12], pattern recognition and image processing [7], and businessapplications [2]. Hierarchical clustering based on a dissimilarity measure is perhaps themost common form of clustering. It is an iterative process of merging (agglomeration)or splitting (partition) of clusters that creates a tree structure called a dendrogram froma set of data points. Centroid-based hierarchical clustering algorithms, such as centroid,medoid, or minimum variance [1], define the dissimilarity metric between two clusters assome function (e.g., Lance-Williams [13]) of distances between cluster centers. Euclideandistance is typically used.

∗This work has been supported by the MICS Division of the US Department of Energy.†Oak Ridge National Laboratory is managed by UT-Battelle for the LLC U.S. D.O.E. under ContractNo. DE-AC05-00OR22725.

Page 38: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

158 SAMATOVA ET AL.

We focus on the distributed hierarchical clustering problem. We create a hierarchicaldecomposition of clusters of massive data sets inherently distributed among various sitesconnected by a network. For practical reasons, the application to distributed and very massive(both in terms of data points and the number of features, or dimensions, for each point)datasets raises a number of major requirements for any solution to this problem:

1. Qualitative comparability. The quality of the hierarchical clustering system produced bythe distributed approach should be comparable to the quality of the clustering hierarchygenerated from centralized data.

2. Computational complexity reduction. Asymptotic time and space complexity of a dis-tributed algorithm should be less than or equal to the asymptotic complexity of thecorresponding centralized approach.

3. Scalability. The algorithms should be scalable with the number of data points, the numberof features, and the number of data stores.

4. Communication acceptability. The data transfer/communication overheads should bemodest. Doing this with minimal communication of data is a challenge.

5. Flexibility. If the solution is based on an existing clustering algorithm, then it should beapplicable to a wide class of clustering algorithms.

6. Visual representation sufficiency. The summarized description of the resulting globalhierarchical cluster structure should be sufficient for its accurate visual representation.

Current clustering approaches do not offer a solution to the distributed hierarchical clus-tering problem that meets all these requirements. Most clustering approaches [3, 9, 14] arerestricted to the centralized data situation that requires bringing all the data together in asingle, centralized warehouse. For large datasets, the transmission cost becomes prohibitive.If centralized, clustering massive centralized data is not feasible in practice using existingalgorithms and hardware.

Distributed clustering approaches necessarily depend on how the data are distributed.Possible combinations are: vertical (features), horizontal (data points), and block fragmen-tations. For vertically distributed data sets, Johnson and Kargupta [10] proposed the Collec-tive Hierarchical Clustering (CHC) algorithm for generating hierarchical clusters. The CHCruns with a O(|S|n) space and O(n)communication requirement, where n is the number ofdata points and |S| is the number of data sites. Its time complexity for the agglomerationphase is O(|S|n2), and the implementation is restricted to single link clustering [1], also re-ferred to as nearest neighbor clustering. This does not include the complexity for generatinglocal hierarchies. Parallel based hierarchical clustering approaches [4, 15] can be consideredas a special case of horizontal data distribution. However, these algorithms are tailored toa specific hardware architecture (e.g., PRAM) or restricted to a certain number of proces-sors. Moreover, there is a major distinction between parallel and horizontally distributedapproaches: the data are already distributed so that we do not have the luxury of distributingdata for optimal algorithm performance as is often done for parallel computation.

We present a clustering algorithm named RACHET that is especially suitable for verylarge, high-dimensional, and horizontally distributed datasets. RACHET builds a globalhierarchy by merging clustering hierarchies generated locally at each of the distributed datasites. Its time, space, and transmission costs are at most O(n) (linear) in the size of the

Page 39: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 159

dataset. This includes only the complexity of the transmission and agglomeration phasesand does not include the complexity of generating local clustering hierarchies. Finally,RACHET’s summarized description of the global clustering hierarchy is sufficient for itsaccurate visual representation that maximally preserves the proximity between data points.

The rest of the paper is organized as follows. Section 2 describes the details of theRACHET algorithm. It first introduces the concept of descriptive statistics for cluster cen-troids and then derives an approximation to the Euclidean metric based on this concept.The description of the process of merging two clustering hierarchies concludes this sec-tion. Section 3 presents time/space/transmission cost analysis of the RACHET algorithm.Section 4 provides some empirical results on real and synthetic datasets. Error of the ap-proximation to the Euclidean metric is discussed in Section 5. Finally, prospects for futurework conclude this paper.

1.1. Definitions and notation

Here, we introduce notation and definitions used throughout the paper. Let n denote thetotal number of data points, d denote the dimensionality of the data space, and |S| denotethe number of data sites. First, we provide a formulation of the distributed hierarchicalclustering problem.

Definition 1. The dendrogram [5] is a tree-like representation of the result of a hierarchicalclustering algorithm. It can be viewed as a sequence of partitions of the data into clustersbeginning with the whole data set at the root node of the tree at the top.

Problem Definition. The distributed hierarchical clustering problem is the problem of cre-ating a global hierarchical decomposition into clusters (represented by a dendrogram) of adata set distributed across various data sites connected by a network. More formally,

Given:1. n data objects with d features each2. a distribution of these data objects across S = {S1, S2, . . . , S|S|} data sites (a horizontal

distribution), and3. a set D = {D1, D2, . . . , D|S|} of local hierarchical decompositions (or local

dendrograms) of clusters of data objects in Si , i = 1, . . . , |S|Find: A global hierarchical decomposition (or global dendrogram) of clusters of n dataobjects,such that the global dendrogram generated from |S| data sites is similar to the dendrogramgenerated from the centralized dataset of n data objects as much as possible.

For a horizontally distributed case, the ideal creation of a global dendrogram should fulfillthe following requirements:

1. It should require minimum data transfer across the network: O(n) or O(n log n) butnot O(nd) or higher, because the communication cost will be prohibitive for high-dimensional datasets.

Page 40: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

160 SAMATOVA ET AL.

2. It should be fast to merge local dendrograms: O(n) or O(n log n) but not O(n2) orhigher, because time and space cost will be too high for massive datasets.

3. It should be of a comparable quality relative to the centralized dendrogram.

Next, we will define the Descriptive Statistics, or summarized cluster representation. Thisis one of the key concepts of this paper. Let C = { �p1, �p2, . . . , �pNc} ⊂ Rd be the set of datapoints in a cluster.

Definition 2. The cluster centroid �c = ( fc1, fc2, . . . , fcd) is the mean vector of all thedata points in the cluster. Hence, the centroid of cluster C is defined as:

�c = 1

Nc

i=Nc∑i=1

�pi (1)

Let pi j denote the j-th component of the data point �pi . Thus, the j-th component of �c isgiven by:

fcj = 1

Nc

Nc∑i=1

pi j , j = 1,d. (2)

Definition 3. The radius of the cluster Rc is defined as the average squared Euclideandistance of a point from the centroid of the cluster. More formally, Rc is given by

Rc :=[∑Nc

i=1 ( �pi − �c)2

Nc

] 12

(3)

Definition 4. The covering hypersphere (�c, Rc) of the cluster C is defined as the hyper-sphere with the center �c and the radius Rc. Each cluster C can be represented by a coveringhypersphere (�c, Rc). In what follows, the terms “cluster” and its “hypersphere” will be usedinterchangeably throughout the paper.

Selection and effective description of cluster Descriptive Statistics (DS), or summarizedcluster representation, is an important step in merging local clustering hierarchies and invisualization of the global hierarchy. DS have to meet a number of major requirements:

• They should occupy much less space than the naive representation, which maintains allobjects in a cluster.

• They should be adequate for efficiently calculating all measurements involved in makingclustering decisions such as merging or reconfiguration.

• They should be sufficient to visually represent the global hierarchy.

Definition 5. The Descriptive Statistics (DS) of the cluster centroid �c are defined as a6-tuple DS(�c) = (Nc, NORMSQc, Rc, SUMc, MINc, MAXc), where

Page 41: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 161

1. Nc is the number of data points in the cluster.2. NORMSQc is the square norm of the centroid defined as:

NORMSQc :=j=d∑j=1

f 2cj (4)

3. Rc is the radius of the cluster4. SUMc is the sum of the components of the centroid defined as:

SUMc := Nc

j=d∑j=1

fcj (5)

5. MINc is the minimum value of the centroid components defined as:

MINc := Nc min1≤ j≤d

fcj (6)

6. MAXc is the maximum value of the centroid components defined as:

MAXc := Nc max1≤ j≤d

fcj (7)

Finally, we define some other notations used for building a global dendrogram:

1. d2(�c1, �c2) is the squared Euclidean distance between two cluster centroids, �c1 and �c2. Itis given by:

d2(�c1, �c2) =d∑

j=1

(fc1 j − fc2 j

)2. (8)

2. d2approx(�c1, �c2) is the approximation to the squared Euclidean distance. It is defined by

Eq. (19). dapprox(�c1, �c2) denotes the square root of d2approx(�c1, �c2).

3. NN(i) is the nearest neighbor of the i th object.4. DISS(i) is the value of dissimilarity (e.g., Euclidean distance) between the i th object and

its nearest neighbor.

2. The RACHET algorithm

We assume the data are distributed across several sites where each site has the same setof features but on different items. Note that this is a horizontal distribution of the data.Homogeneity is assumed not only for the type of features of the problem domain butalso for the units of measurements of those features. Next, we use Euclidean distance as themeasure of dissimilarity between individual points. Finally, the implementation of RACHETassumes a centroid-based hierarchical clustering algorithm, such as centroid, medoid, or

Page 42: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

162 SAMATOVA ET AL.

Figure 1. Control flow of RACHET.

minimum variance (Ward’s). An overview of these hierarchical clustering methods can befound in [1].

Figure 1 presents the control flow of RACHET. Phase 1 is designed to generate a localdendrogram from each of the distributed data sites using a given off-the-shelf centroid-basedhierarchical clustering algorithm. For each node of the dendrogram, RACHET maintainsDescriptive Statistics (DS) about the cluster at that node. The space complexity of the DS isconstant. This summary is not only efficient because it requires much less space than storingall the data points in the cluster, but also effective because it is sufficient for calculatingall measurements involved in making clustering decisions in consecutive phases. Moredetails on DS are presented in Section 2.1. We adapted Ward’s agglomerative hierarchicalclustering algorithm to generate and maintain the DS.

After Phase 1, we obtain a list of dendrograms that captures the major informationabout each cluster centroid needed for making clustering decisions in Phase 3. In Phase 2,local dendrograms are transmitted to a single merger site. The agglomeration of thesedendrograms is performed at the merger site.

Phase 3 is the core of RACHET. The main task of Phase 3 is to cluster local dendrogramsinto a global dendrogram. We adapted an agglomerative hierarchical algorithm for clusteringdata points [15] by applying it directly to local dendrograms. The algorithm is shown infigure 2. One of the key components in this algorithm is the call of the merge-dendrograms( )method that merges two dendrograms. The details of this method are discussed in Section2.3. Due to the lack of space, we omit description of the find best match( ) method. Itspseudo code is available at http://www.csm.ornl.gov/∼ost/RACHET.

2.1. Centroid descriptive statistics

Given the descriptive statistics (see Definition 5) of two cluster centroids �c1 and �c2, thissection provides a mechanism for updating the descriptive statistics of cluster �c formed bymerging clusters �c1 and �c2 without regenerating them from “scratch”.

Page 43: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 163

Figure 2. An efficient algorithm to build a global dendrogram.

Theorem 2.1. Assume that DS(�c1) = (Nc1 , NORMSQc1, Rc1 , SUMc1 , MINc1 , MAXc1) and

DS(�c2) = (Nc2 , NORMSQc2, Rc2 , SUMc2 , MINc2 , MAXc2) are the descriptive statistics of

two disjoint clusters �c1 and �c2, respectively. Then the following statements hold for thedescriptive statistics of cluster �c that is formed by merging these clusters:1. Nc = Nc1 + Nc2 .2. NORMSQc = 1

Nc1 + Nc2{Nc1 NORMSQc1

+ Nc2 NORMSQc2− Nc1 Nc2

Nc1 +Nc2d2(�c1, �c2)}, where

d2(�c1, �c2) is the squared Euclidean distance between the two centroids.3. Rc = [ 1

Nc1 +Nc2{Nc1 R2

c1+ Nc2 R2

c2+ Nc1 Nc2

Nc1 +Nc2d2(�c1, �c2)}] 1

2

4. SUMc = SUMc1 + SUMc2

5. MINc ≥ MINc1 + MINc2

6. MAXc ≤ MAXc1 + MAXc2

Proof: In order to evaluate the square norm of centroid �c that is formed by merging disjointclusters �c1 and �c2, we first note that based on Eq. (2) the j-th component of �c can be definedby the relation

Nc fcj = Nc1 fc1 j + Nc2 fc2 j

Squaring both sides of this equation gives

N 2c f 2

cj = N 2c1

f 2c1 j + N 2

c2f 2c2 j + 2Nc1 Nc2 fc1 j fc2 j (9)

The cross-product term can be written as

2 fc1 j fc2 j = f 2c1 j + f 2

c2 j − (fc1 j − fc2 j

)2(10)

Page 44: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

164 SAMATOVA ET AL.

Substituting Eq. (10) into Eq. (9) and dividing both sides by N 2c then gives

f 2cj = 1

Nc1 + Nc2

{Nc1 f 2

c1 j + Nc2 f 2c2 j − Nc1 Nc2

Nc1 + Nc2

(fc1 j − fc2 j

)2}

Summing both sides of this equation over j results in

d∑j=1

f 2cj = 1

Nc1 + Nc2

{Nc1

d∑j=1

f 2c1 j + Nc2

d∑j=1

f 2c2 j − Nc1 Nc2

Nc1 + Nc2

d∑j=1

(fc1 j − fc2 j

)2

}

(11)

This proves the update formula for the NORMSQc.From the definition of cluster centroid �c, it follows that its squared radius can be written

as:

Nc R2c =

Nc∑i=1

d∑j=1

(pi j − fcj )2 =

d∑j=1

Nc∑i=1

p2i j − Nc

d∑j=1

f 2cj (12)

If cluster �c is formed by merging two disjoint clusters �c1 and �c2, then the first term in Eq. (12)can be decomposed into the sum of squared coordinates of data points in the first clusterand the sum of squared coordinates of points in the second cluster. That is,

d∑j=1

Nc∑i=1

p2i j =

d∑j=1

Nc1∑i=1

p2i j +

d∑j=1

Nc2∑i=1

p2i j (13)

Substituting Eqs. (11) and (13) into Eq. (12) and regrouping the terms then gives

Nc R2c =

(d∑

j=1

Nc1∑i=1

p2i j − Nc1

d∑j=1

f 2c1 j

)+

(d∑

j=1

Nc2∑i=1

p2i j − Nc2

d∑j=1

f 2c2 j

)

+ Nc1 Nc2

Nc1 + Nc2

d∑j=1

(fc1 j − fc2 j

)2

Applying Eq. (12) to clusters �c1 and �c2, the last equation can be written as

Nc R2c = Nc1 R2

c1+ Nc2 R2

c1+ Nc1 Nc2

Nc1 + Nc2

d2(�c1, �c2)

This proves the update formula for the Rc.To derive the lower bound on MINc of the centroid �c, we note that each component j of

�c can be represented as

Nc fcj =Nc∑

i=1

pi j =Nc1∑i=1

pi j +Nc2∑i=1

pi j = Nc1 fc1 j + Nc2 fc2 j (14)

Page 45: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 165

By definition of MINc1 and MINc2 , it follows that

fc1 j ≥ 1

Nc1

MINc1 and fc2 j ≥ 1

Nc2

MINc2 for j = 1, . . . , d.

Hence, Eq. (14) can be estimated as

Nc fcj ≥ MINc1 + MINc2

Taking the minimum from both sides of this inequality over j proves the lower boundfor the MINc. The update formulas for the other parameters of DS(�c) can be provensimilarly. ✷

2.2. Euclidean distance approximation

From Eq. (8), it follows that in order to compute the Euclidean distance between centroidsfrom different local datasets would require the transmission of all d centroid components.This approach would involve the transmission of cluster centroids represented by eachnode of the dendrogram generated at each of the |S| local datasets. This would result in atransmission cost of O(nd), which can be prohibitively high.

Given the DS of each cluster, we can derive an approximated distance between the twocluster centroids. Equation (8) can be expanded as follows:

d2(�c1, �c2) =d∑

j=1

f 2c1 j +

d∑j=1

f 2c2 j − 2

d∑j=1

fc1 j fc2 j (15)

d2(�c1, �c2) = NORMSQc1+ NORMSQc2

− 2d∑

j=1

fc1 j fc2 j (16)

If the cross-product term is ignored, then the distance can be approximated by the sum ofsquare norms of the centroids. This results in a significant error. To reduce this error, wecan place a non-zero upper and lower bound on the cross-product term:

1

Nc1 Nc2

MINc1 SUMc2 ≤d∑

j=1

fc1 j fc2 j ≤ 1

Nc1 Nc2

MAXc1 SUMc2 (17)

or

1

Nc1 Nc2

MINc2 SUMc1 ≤d∑

j=1

fc1 j fc2 j ≤ 1

Nc1 Nc2

MAXc2 SUMc1 (18)

Inequalities (17) and (18) hold, if each component of the cluster centroid is positive, i.e.fcj > 0, ∀c and j = 1, . . . , d . Otherwise, for O(|S|) communication cost, we can broadcast

Page 46: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

166 SAMATOVA ET AL.

the global constant CONST such that

pnewi j = pold

i j + CONST > 0

for each component j of the data point �pi . Taking the maximum of the lower bounds andthe minimum of the upper bounds in (17) and (18) leads to the following bounds on theEuclidean distance:

d2lower(�c1, �c2) = max

{0, NORMSQc1

+ NORMSQc2

− 21

Nc1 Nc2

min{MAXc1 SUMc2 , MAXc2 SUMc1

}}d2

upper(�c1, �c2) = NORMSQc1+ NORMSQc2

− 21

Nc1 Nc2

max{MINc1 SUMc2 , MINc2 SUMc1

}Taking the simple mean of the minimum and the maximum square distances gives anapproximation of the squared Euclidean distance between two centroids

d2approx(�c1, �c2) = d2

lower + d2upper

2(19)

2.3. Merging two dendrograms

Given two datasets S1 and S2 and their dendrograms D1 and D2 generated by a hierarchicalclustering algorithm applied locally to each data set, figure 3 illustrates four different cases(out of six possible) of merging the two dendrograms (figure 3(a)) into dendrogram Dnew.

Case 1 (figure 3(b)). This case is designed to merge two well separated datasets. Twoclusters, �c1 and �c2, are well separated if their hyperspheres do not intersect. That is,

d(�c1, �c2) ≥ Rc1 + Rc2 .

In this case, a new parent node, Dnew, is created and dendrograms D1 and D2 become thechildren of the new node. The descriptive statistics DS(�cnew) of the new cluster are updatedaccording to Theorem 2.1.

Case 2. Here, the data points of the first cluster are contained in the hypersphere withcenter �c2 and radius Rc2 , i.e.

d(�c1, �c2) < Rc2 .

This case is further subdivided into two subcases:

Page 47: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 167

Figure 3. Four cases of merging two dendrograms. (a) Two dendrograms D1 and D2. (b) Merging two wellseparated clusters (Case 1). (c) Making cluster S1 a subcluster of cluster S2 provided a proper containment ofcluster S1 in cluster S2 (Case 2a). (d) Merging cluster S1 with the best matched subcluster of cluster S2 provideda proper containment of cluster S1 in cluster S2 (Case 2b). (e) Merging two overlapping clusters (Case 4).

Case 2a (figure 3(c)). The first cluster (�c1, Rc1) is well separated from any other childcluster (�c2 j , Rc2 j ) of the second cluster (�c2, Rc2), j = 1, 2, . . .. In this case, dendrogram D1

becomes a new child of dendrogram D2. The descriptive statistics DS(�cnew) are updatedsimilarly to Case 1.

Case 2b (figure 3(d)). The first cluster (�c1, Rc1) overlaps with one or more child clusters(�c2 j , Rc2 j ) of the second cluster (�c2, Rc2), j = 1, 2, . . .. Here the child cluster that matchesbest with dendrogram D1 is selected to be merged with this dendrogram using a recursive callto the merge dendrograms( ) process. There are a number of possible choices for defining a“best match”. One choice for the best match is the cluster that has the largest intersection vol-ume with the candidate cluster. The new node that is returned by the merge dendrograms( )process replaces the selected child in dendrogram D2. If the new node Dnew has more thantwo children, then its descriptive statistics are obtained by repeatedly applying Theorem 2.1to two children at a time.

Case 3. This case addresses the situation when data points of the second cluster arecontained in the hypersphere with center �c1 and radius Rc1 , i.e.

d(�c1, �c2) < Rc1 .

This case is a degenerate example of Case 2.

Page 48: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

168 SAMATOVA ET AL.

Case 4 (figure 3(e)). This last case is designed to merge partially overlapped clusters, i.e.

(d(�c1, �c2) < Rc1 + Rc2

)and

(d(�c1, �c2) > Rc1 or d(�c1, �c2) > Rc2

).

This case tries to improve the quality of the clustering by reconfiguring the children of bothdendrograms D1 and D2. First, a new parent node, Dnew, with D1 and D2 as its children iscreated and its DS are updated like in case 1. Second, the children of both dendrograms arepartitioned into three categories:

1. The D′1 category that contains all the children of D1 that do not overlap with (�c2, Rc2)

2. The D′2 category that contains all the children of D2 that do not overlap with (�c1, Rc1)

3. The D12 category that includes the children that overlap with both (�c1, Rc1) and (�c2, Rc2).

Next, all children that are not in the D′1 category are removed from D1. The DS about D1

are updated to reflect these changes. If the modified node D1 has more than one child, thenits descriptive statistics are obtained by repeatedly applying Theorem 2.1 to two childrenat a time. Otherwise, the D1 node is replaced by its only child. Similar steps are done forD2. Finally, the build-global-dendrogram( ) process is called using the children in the D12

category. The node that is returned by this method becomes a new child of Dnew.Figure 4 describes an algorithm that merges two clustering hierarchies.

Note that definitions of several methods in figure 4 such as create-parent( ), add-child( ),find best match( ), delete-children( ), update-DS( ) are omitted in the paper due to the lackof space. The pseudo codes of these methods can be found at http://www.csm.ornl.gov/∼ost/RACHET.

3. Complexity analysis

This section presents complexity analysis for Phase 2 and Phase 3 (figure 1) of the RACHETalgorithm. The overall cost of transmitting the local dendrograms to the merger site (Phase 2)is given by:

Transmissiontotal =|S|∑i=1

Transmission(i)

where Transmission(i) is the cost of transmission of a given local dendrogram i . Giventhe nature of the dendrogram, there is a total of 2ni − 1 nodes in the dendrogram with ni

leaf nodes. We use an array-based format for the dendrogram representation as describedin [10]. In this format, there are 2ni − 1 elements in the array. Each element in the arraycontains at most 4 items to represent each node with additional 6 items to represent thedescriptive statistics about the cluster centroid associated with each node. Thus, the cost oftransmission of a given local dendrogram ican be written as:

Transmission(i) = O((4 + 6) × (2ni − 1)) = O(ni )

Page 49: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 169

Figure 4. An algorithm to efficiently merge two dendrograms.

Therefore, the total cost of transmission for Phase 2 is given by:

Transmissiontotal = O(n) (20)

In order to evaluate time and space complexity of generating the global dendrogram(Phase 3), first we analyze the time and space complexity of the merge-dendrograms( )method (figure 4), the key component of Phase 3. The merge-dendrograms( ) algorithm usesa recursive top-down strategy (with no backtracking) to efficiently merge two dendrograms.The recursion stops when Case 1, Case 2a, or Case 3a (“stopping” cases) happens or a leafnode is reached. Otherwise, the merge-dendrograms( ) or the build-global-dendrogram( )process proceeds recursively. In constant time we can decide which of the six cases formerging the two dendrograms occurs. Assuming that the branching factor B of a dendrogram

Page 50: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

170 SAMATOVA ET AL.

is a constant, we can perform basic operations on dendrograms (adding and deleting achild, updating descriptive statistics, finding best match, etc.) required by each case inconstant time as well. Every time a non-stopping case is selected, we descend a level in thedendrogram. If the average depth of a dendrogram is O(logB n), merging two dendrogramsrequires on average O(logB n) time. However, for very unbalanced dendrograms, the worstcase for merging two dendrograms requires O(n) time. This is an upper bound. As forthe space cost of the merge-dendrograms( ) method, we need O(n) space to store bothdendrograms and O(logB n) average stack space to process the recursion. Thus, the merge-dendrograms( ) method requires O(n) space.

To compute the overall time and space costs of Phase 3 described by the build-global-dendrogram( ) algorithm (figure 2), we first note that this algorithm is an adaptation ofthe hierarchical clustering algorithm [15]. There are two main parts in this algorithm: theinitialization part represented by lines (1) through (4) and the agglomeration part representedby lines (6) through (11). Computing the array storing each dapprox(�ci , �c j ) (line (1)) requiresO(|S|2) space and time. Given this array, the initialization of arrays storing each NN(i)and DISS(i) adds a factor of O(|S|) and O(|S|2) to the overall space and time complexity,respectively. Thus, the time and space complexities of the initialization part of the algorithmare given by:

Timeinit = O(|S|2) (21)

Spaceinit = O(|S|2) (22)

The agglomeration part that starts on line (6) repeats |S| − 1 times. Determining thetwo closest dendrograms (lines (7) through (9)) can be performed in O(|S|) time by ex-amining each dendrogram’s best match. Based on the complexity analysis of the merge-denrograms( ) algorithm, the agglomeration in line (10) requires O(n) time and O(n) space.Finally, the updating step (line (11)) can be performed in O(|S|) time for metrics that sat-isfy the reducibility property [14]. Otherwise, the algorithm requires at most O(|S|2) periteration to update the arrays. Thus, the time complexity of the agglomeration part of thealgorithm is given by:

Timeagglom = O((|S| − 1) · |S|) + O((|S| − 1) · n) + O((|S| − 1) · |S|2)= O(|S|2) + O(|S|n) (23)

The space complexity of the agglomeration part of the algorithm is given by:

Spaceagglom = O(|S|2) + O(n) (24)

Hence, the overall time and space complexity for hierarchical clustering of local dendro-grams presented in figure 2 is given by:

Timetotal = Timeinit + Timeagglom

Spacetotal = Spaceinit + Spaceagglom

Page 51: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 171

Using the time and space costs as given in Eqs. (21) through (24), the total time and spacecost for Phase 3 is given by:

Timetotal = O(|S|2) + O(|S|n)

Spacetotal = O(|S|2) + O(n)

Which are effectively O(n), when |S| is constant and |S| � n.

4. Empirical evaluation

In this section, we evaluate the effectiveness of RACHET on several datasets. Tests aredone on synthetic datasets and also on “real world” datasets from the ML Repository at UCIrvine, available at http://www.ics.uci.edu/AI/ML/MLDBRepository.html. We use Ward’sagglomerative hierarchical clustering algorithm [1] for generating local dendrograms in allof our experiments.

4.1. Experimental methodology

In order to evaluate the effectiveness of the RACHET algorithm relative to the centralizedWard’s clustering algorithm, we use the method described by Kargupta et al. [11]. Thedendrograms generated in the centralized and distributed fashion are “cut” at differentlevels such that the same number of clusters, l, results from each. Then, an adjacencymatrix is constructed as follows. The element ai j of the adjacency matrix is one if thei th and j th data points belong to the same cluster. Otherwise it is zero. The error E(l) ofmisclassifications comparing the centralized and distributed algorithms is measured as theratio of the sum of absolute differences between elements of adjacency matrices to the totalnumber of elements. More formally, E(l) is defined as:

E(l) =∑n

j=1

∑ni=1 |ci j − di j |n2

, (25)

where ci j and di j are elements of the adjacency matrix for centralized and distributedalgorithm, respectively.

4.2. Results for synthetic data sets

The main purpose of this section is to study the sensitivity of RACHET to various char-acteristics of the input. The characteristics include various partitions of data points acrossdata sites, the number of data sites, and different dimensionality of data. We first introducethe synthetic data sets.

Synthetic data was created for dimensionality d = 2, 4, 8, and 16. For a given value ofd, data was sampled from four Gaussian distributions (hence number of clusters K = 4).

Page 52: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

172 SAMATOVA ET AL.

The number of points in each Gaussian is n/K and its mean vector is sampled from a uni-form distribution on [min val + k · max val, max val + k · max val] for k = 0, . . . , K − 1.The values for min val and max val are 5 and 15, respectively. Elements of the diagonalcovariance matrix are sampled from a uniform distribution on [0, min val/6]. Hence, theseare fairly well-separated Gaussians, an ideal situation for a centralized Ward’s clusteringalgorithm. Note that E(l) for l = K in (25) is calculated with the correct classification ofpoints by the centralized algorithm. In many real world data sets, the behavior of the cen-tralized algorithm is not “best-case” for a given K , hence the misclassification error of thedistributed algorithm relative to the centralized algorithm is not necessarily an appropriatemeasure (and indeed it is as demonstrated in Table 4). In this case, a comparison with aknown classifier might be preferred.

We test the performance of the algorithm with different numbers of data sites and variousdistributions of data points across data sites. Note that the “best-case” scenario for RACHETis when all points of the same cluster are assigned to be in a single data site (homogeneousassignment). The “worst case” scenario most likely occurs when each data site contains asubset of points from each of the K clusters (heterogeneous assignment). Table 1 shows thepercentage of misclassifications of our algorithm relative to the centralized algorithm for2, 4, and 6 data sites with a heterogeneous assignment of data points. More precisely, ni

(i = 0, . . . , K − 1) points are randomly selected from each Gaussian and assigned so thateach data site has points from each of the Gaussians. For our experiments, ni is chosen as(n/K )/|S| so that each data site has roughly the same number of points from each Gaussian.The total number of data points n = 1024. The dendrograms were split at level l = 2, 4, 6, 8,and 10. Since we use random sampling in creating synthetic data sets and assigning pointsto data sites, here we present the average of twenty-five different runs of the distributedalgorithm. We generate a synthetic data set and prepare five different random assignmentsof data points to sites. We take the average of the five resulting adjacency matrices. Eachaverage value is rounded off to the nearest Boolean value. Then we average the obtainedmisclassification errors across five different synthetic data sets.

We can make an observation from the Table 1 that RACHET achieves good performanceat a higher division level of the dendrogram. Its behavior on the synthetic data remains

Table 1. Percentage of misclassifications at different level of division of the global dendrogram generated from|S| = 2, 4, and 6 sites compared to the centrally generated dendrogram.

d = 2 d = 4 d = 8 d = 16Divison

level |S| = 2 |S| = 4 |S| = 6 |S| = 2 |S| = 4 |S| = 6 |S| = 2 |S| = 4 |S| = 6 |S| = 2 |S| = 4 |S| = 6

2 46% 32% 36% 22% 36% 36% 46% 41% 39% 28% 20% 28%

4 9% 24% 27% 19% 9% 22% 14% 12% 12% 9% 13% 10%

6 9% 12% 8% 9% 12% 10% 12% 13% 12% 5% 12% 12%

8 14% 10% 10% 10% 14% 13% 11% 15% 14% 8% 14% 16%

10 17% 11% 12% 11% 12% 13% 13% 13% 14% 8% 16% 17%

Results are for synthetic data sets of size n = 1024 and dimension d = 2, 4, 8, and 16 with heterogeneous assignmentof points to data sites.

Page 53: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 173

roughly unchanged with the number of dimensions and the number of data sites. RACHETshows the worst performance for division level 2, which seems quite natural for the “worst-case” scenario of heterogeneous assignment of data items to data sites. We have not observedthe degradation of performance at this division level for the homogeneous assignment ofdata points.

4.3. Results on real-world data sets

We have tested the algorithm on 3 publicly available “real-world” datasets obtained fromUCI ML Repository: the Boston Housing data, the E.coli data, and the Pima Indians Diabetesdata. Table 2 provides a brief summary of these data sets (see Appendix A for more detaileddataset statistics by features).

The E .coli dataset contains 336 data points in 7 dimensions that are classified into 8clusters. Figure 5 shows the density plot constructed based on the adjacency matrix. Thematrix is obtained by the centralized algorithm with the division level set to 8. Figure 6shows the density plot obtained by our algorithm for the same division level with twodata sites. Figure 7 shows their difference. The error of misclassification relative to thecentralized algorithm is 9%.

Table 3 summarizes the comparative results at different levels of division of the dendro-gram between the centrally generated dendrogram and the global dendrogram generatedfrom two and four data sites for all three data sets. For each data set, experiments are run

Table 2. Brief summary of the three data sets from the UCI ML Repository.

Data set No. of items n No. of features d No. of classes

E.coli 336 7 8

Boston Housing 506 14 N/A

Pima Indians Diabetes 768 8 2

Table 3. Percentage of misclassifications at different level of division of the global dendrogram generated from|S| = 2 and 4 sites compared to the centrally generated dendrogram.

Boston Housing E.coli Pima Indians DiabetesDivisonlevel |S| = 2 |S| = 4 |S| = 2 |S| = 4 |S| = 2 |S| = 4

2 49% 34% 32% 45% 49% 47%

4 29% 34% 8% 32% 36% 43%

6 26% 24% 9% 29% 31% 36%

8 20% 18% 9% 21% 17% 37%

10 18% 17% 10% 22% 16% 29%

12 14% 13% 10% 22% 15% 26%

Results are for real data with random assignment of points to data sites.

Page 54: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

174 SAMATOVA ET AL.

Figure 5. Density plot at division level 8 for centralized clustering of the E.coli data.

Figure 6. Density plot at division level 8 for distributed clustering of the E.coli data with two data sites.

five times with different random assignment of data points to data sites and the resultsare averaged over these runs. No class labels have been used. Note that the performancedoes not change with the number of dimensions. It improves at higher division levels asopposed to the results on the Boston Housing data provided in [10] for vertical (by features)distribution of data items across data sites.

Page 55: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 175

Figure 7. Difference between adjacency matrices obtained for centralized and distributed algorithm at divisionlevel 8 for the E.coli data with two data sites.

Comparing the results for real (Table 3) and synthetic (Table 1) data sets, we see thatthe accuracy of RACHET is slightly worse for real data sets. There are several reasonsfor the performance deterioration. First, the real data sets are more heterogeneous in termsof the units of measurements and the range of values for the parameters of descriptivestatistics. This is illustrated by the summary statistics in Appendix A. Second, the errorof misclassification is evaluated relative to the centralized algorithm (see (25)). It maynot necessarily be a good measure of RACHET’s performance since the behavior of thecentralized algorithm may be poor for some real world data sets. For example, Table 4demonstrates that the centralized algorithm performs worse than the distributed algorithmwhen compared with the clustering results for known class labels.

We also test the scalability of RACHET with the number of data sites. While the perfor-mance for the Boston Housing data (see Table 3), like the performance for synthetic datasets, remains roughly unchanged as the number of data sites increases, the performance

Table 4. Comparative results have between clusters identified by class labels and clusters obtained by the cen-tralized algorithum at the division level of the dendrogram set to the number of classes and the global dendrogramgenerated from |S| = 2 and 4 sites at the same division level.

Clusters with known class labels compared to:

Data set # Classes Centralized |S| = 2 |S| = 4

E.coli 8 22% 17% 20%

Pima Indians Diabetes 2 50% 39% 49%

Page 56: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

176 SAMATOVA ET AL.

for the E.coli and Pima Indians Diabetes data sets degrades. We have not identified thereasons for the performance degradation of the latter data sets; however, we expect betterscalability with the number of data sites by enriching the descriptive statistics so that adesirable trade-off between transmission cost and accuracy is achieved. Currently, we areinvestigating approaches in this direction such as using Principal Component Analysis [8],providing low-degree polynomial approximations to feature vectors, as well as explicitlyadding coordinates of cluster centroids corresponding to some division level k, k � n, ofeach local dendrogram.

5. Discussion

5.1. Error analysis

In this section, we present a brief discussion of the error in our Euclidean distance approxi-mation (19). Let ε(�c1, �c2) denote an absolute error of the approximation for the Euclideandistance between two centroids �c1 and �c2 defined as:

ε(�c1, �c2) = |d(�c1, �c2) − dapprox(�c1, �c2)| (26)

First, we make several observations about the behavior of ε(�c1, �c2).

Observation 1. If MINc1 = MAXc1 then ε(�c1, �c2) = 0 for any centroid �c2.

In other words, if one of the centroids lies on the bissecting line (i.e., all the vectorcoordinates are the same), then the approximated distance equals the exact distance.

Observation 2. If MINc1 �= MAXc1 and MINc2 �= MAXc2 then

max�c2

ε(�c1, �c2) = ε(�c1, �c1) and max�c1

ε(�c1, �c2) = ε(�c2, �c2).

In other words, the absolute error achieves its maximum value when centroids are veryclose to each other provided neither of them lies on the bissecting line.

Observation 3. Let �c be any data point on the bissecting line (e.g., unit vector). ByObservation 1, descriptive statistics of �c1 and �c2 are sufficient for the exact calculationof the low and upper bounds for dapprox(�c1, �c2) defined as:

|d(�c1, �c) − d(�c2, �c)| ≤ dapprox(�c1, �c2) ≤ d(�c1, �c) + d(�c2, �c)

Hence,

ε(�c1, �c2) ≤ 2 · min{d(�c1, �c), d(�c2, �c)} (27)

The proofs of these observations follow from elementary algebra and are omitted here.

Page 57: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 177

Figure 8. Absolute error plot for the first centroid set to (50,55) and the second centroid set to all integers between0 and 100.

Figure 8 illustrates the behavior of the absolute error in 2D when the coordinates of �c1 areset to (50, 55) and the coordinates of �c2 are set to all possible integers between 0 and 100.We can see that the absolute error increases when �c2 approaches �c1 and equals zero when�c2 reaches the bissecting line (x = y). The maximum value of the absolute error dependson the location of �c1 as can be seen in figure 9 when �c1 = (99, 89) and is bounded by itsdistance to the bissecting line.

Figure 9. Absolute error plot for the first centroid set to (99,89) and the second centroid set to all integers between0 and 100.

Page 58: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

178 SAMATOVA ET AL.

In spite of the fact that the absolute error (see (27)) of the Euclidean distance approx-imation can be large depending on the location of centroids, the overall performance ofRACHET is still reasonably good as demonstrated on both synthetic and real datasets inSection 4. There is a heuristic explanation for this behavior. Note that if two centroids arevery close to each other, then the merge dendrograms( ) subroutine (see figure 4) will mis-classify them as belonging to Case 1, i.e. well separated clusters, as opposed to one of theoverlapping cases (Case 2, 3, or 4). In this situation, the recursion stops instead of refiningthe process of merging local dendrograms. However, this situation is more likely to occurcloser to the leaves of local dendrograms rather than the roots. Many of our experiments onsynthetic and real data sets support this statement with a few exceptions for some specifi-cally designed assignments of data points to data sites. Hence, the performance of RACHETwill degenerate more at the division level close to the leaves of the global dendrogram andwill remain acceptable at moderate division levels.

5.2. Future work

Empirical results on synthetic Gaussian data indicate that RACHET provides a compara-ble quality solution to the distributed hierarchical clustering problem while being scalablewith both the number of dimensions and the number of data sites. Results on the smallreal-world UCI ML data sets indicate that RACHET can provide a more effective clusteringsolution than the solution generated by the centralized clustering. The reason for usingsmall real data sets is that the goal at this stage is to demonstrate ability to create a com-parable quality global dendrogram from distributed local dendrograms within reasonablerequirements for the time, space, and communication cost. However, based on the theo-retical results for linear time/space/communication complexity of RACHET, the next stepis to study the efficiency of RACHET in dealing with very large (gigabytes or terabytes)and very high-dimensional (thousands of features) real data sets. Example of such mas-sive datasets might be the Reuters text classification database consisting of the documentswith hundreds of thousands of words (i.e., hundreds of thousands of dimensions) or thePCMDI archive of climate simulation model outputs with each output in the order of acouple of terabytes and 2500 or more dimensions. We believe that the RACHET algorithmis scalable to such sizes of the problem because it transforms a large problem into a setof small subproblems with cumulative computational cost much less than the aggregateproblem.

The distributed hierarchical clustering algorithm proposed here is in the context ofcentroid-based hierarchical algorithms using Euclidean distance as a dissimilarity measurebetween two data objects. We note that similar ideas can be extended to other hierarchicalclustering algorithms as well as to non-Euclidean dissimilarity measures.

6. Summary

This paper presents RACHET, a hierarchical clustering method for very large, high-dimen-sional, horizontally distributed datasets. Most hierarchical clustering algorithms suffer from

Page 59: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

RACHET: AN EFFICIENT COVER-BASED MERGING 179

severe drawbacks when applied to very massive and distributed datasets: 1) they requireprohibitively high communication cost to centralize the data to a single site and 2) theydo not scale up with number of data items and with dimensionality of data sets. RACHETmakes the scalability problem more tractable. This is achieved by generating local clusteringhierarchies on smaller data subsets and using condensed cluster summaries for the conse-cutive agglomeration of these hierarchies while maintaining clustering quality. Moreover,RACHET has significantly lower (linear) communication costs than traditional centralizedapproaches.

Appendix A: Summary of datasets statistics

Feature Minimal Maximal Mean Standardnumber feature value feature value feature value deviation

E.coli

1 0.00 0.89 0.50 0.19

2 0.16 1.00 0.50 0.15

3 0.48 1.00 0.50 0.09

4 0.50 1.00 0.50 0.03

5 0.00 0.88 0.50 0.12

6 0.03 1.00 0.50 0.22

7 0.00 0.99 0.50 0.21

Boston Housing

1 0.01 88.98 3.61 8.60

2 0.00 100.00 11.36 23.32

3 0.46 27.74 11.14 6.86

4 0.00 1.00 0.07 0.25

5 0.39 0.87 0.55 0.12

6 3.56 8.78 6.28 0.70

7 2.90 100.98 68.57 28.15

8 1.13 12.13 3.80 2.11

9 1.00 24.00 9.55 8.71

10 187.00 711.00 408.24 168.54

11 12.60 22.00 18.46 2.16

12 0.32 396.90 356.67 91.29

13 1.73 37.97 12.65 7.14

14 5.00 50.00 22.53 9.20

Pima Indians Diabetes

1 0.00 17.00 3.8 3.4

2 0.00 199.00 120.9 32

(Continued on next page.)

Page 60: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

180 SAMATOVA ET AL.

(Continued).

Feature Minimal Maximal Mean Standardnumber feature value feature value feature value deviation

3 0.00 122.00 69.1 19.4

4 0.00 99.00 20.5 16

5 0.00 846.00 79.8 115.2

6 0.00 67.10 32 7.9

7 0.08 2.42 0.5 0.3

8 21.00 81.00 33.2 11.8

Acknowledgment

This research was supported in part by an appointment to the Oak Ridge National Labo-ratory Postdoctoral Research Associates Program administered jointly by the Oak RidgeAssociation of Universities and Oak Ridge National Laboratory.

References

1. M.R. Anderberg, Cluster Analysis and Applications, Academic Press: New York, 1973.2. R. Brachman, T. Khabaza, W. Kloesgen, G. Piatetsky-Shapiro, and E. Simoudis, “Mining business databases,”

Communications of ACM, vol. 39, no. 11, pp. 42–48, 1996.3. W.H.E. Day and H. Edelsbrunner, “Efficient algorithms for agglomerative hierarchical clustering methods,”

Journal of Classification, vol. 1, pp. 7–24, 1984.4. I. Dhillon and D. Modha, “A data clustering algorithm on distributed memory multiprocessors,” in Large-

Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, Mohammed Javeed Zaki andChing-Tien Ho (Eds.), SIGKDD, Aug. 15, 1999, San Diego, CA, USA, pp. 245–260.

5. R. Dubes and A. Jain, “Clustering methodologies in exploratory data analysis,” Advances in Computers,vol. 19, pp. 113–228, 1980.

6. U. Fayyad, D. Haussler, P. Stolorz, and R. Uthurusamy (Eds.), Advances in Knowledge Discovery and DataMining, MIT Press: Cambridge, MA, 1996.

7. K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press: New York, 1990.8. J.E. Jackson, A User’s Guide to Principal Components, John Wiley & Sons: New York, 1991.9. A.K. Jain, M.N. Murty, and P.J. Flynn, “Data clustering: A review,” ACM Computing Surveys, vol. 31,

pp. 264–323, 1999.10. E. Johnson and H. Kargupta, “Collective, hierarchical clustering from distributed, heterogeneous data,” Lecture

Notes in Computer Science, vol. 1759, Springer-Verlag: Berlin, 1999.11. H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson, “Distributed clustering using collective principal

component analysis,” Knowledge and Information Systems, vol. 3, no. 4, pp. 422–448, 2001.12. L. Kaufman and P. Rousseeuw, Finding Groups in Data, John Wiley and Sons: New York, 1989.13. G.N. Lance and W.T. Williams, “A general theory of classificatory sorting strategies. 1: Hierarchical systems,”

Computer Journal, vol. 9, pp. 373–380, 1967.14. F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,” Computer Journal, vol. 26,

pp. 354–359, 1983.15. C. Olson, “Parallel algorithms for hierarchical clustering,” Parallel Computing, vol. 8, pp. 1313–1325, 1995.

Page 61: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Distributed and Parallel Databases, 11, 181–201, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Parallelizing the Data Cube∗FRANK DEHNE [email protected], www.dehne.netSchool of Computer Science, Carleton University, Ottawa, Canada K1S 5B6

TODD EAVIS [email protected] of Computer Science, Dalhousie University, Halifax, Canada B3H 1W5

SUSANNE HAMBRUSCH [email protected], www.cs.purdue.edu/people/sehDepartment of Computer Science, Purdue University, West Lafayette, IN 47907, USA

ANDREW RAU-CHAPLIN [email protected], www.cs.dal.ca/∼arcFaculty of Computer Science, Dalhousie University, Halifax, Canada B3H 1W5

Recommended by: Mohammed J. Zaki

Abstract. This paper presents a general methodology for the efficient parallelization of existing data cubeconstruction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way thatthe loads assigned to the processors are balanced. Our methods reduce inter processor communication overheadby partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioningstrategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders betweendifferent group-by computations. Our methods enable code reuse by permitting the use of existing sequential(external memory) data cube algorithms for the subcube computations on each processor. This supports thetransfer of optimized sequential data cube code to a parallel setting.

The bottom-up partitioning strategy balances the number of single attribute external memory sorts made byeach processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific costmeasures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk typeparallel machine composed of p processors connected via an interconnection fabric and with access to a sharedparallel disk array.

We have implemented our parallel top-down data cube construction method in C++ with the MPI messagepassing library for communication and the LEDA library for the required graph algorithms. We tested our code on aneight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew.Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generatea close to optimal load balance between processors. The actual run times observed show an optimal speedupof p.

Keywords: OLAP, data cube, parallel processing, partitioning, load balancing

∗A preliminary version of this paper has been published in the proceedings of the 8th International Conference onDatabase Theory (ICDT 2001), London, UK, January 2001. Research partially supported by the Natural Sciencesand Engineering Research Council of Canada and the National Science Foundation (Grant 9988339-CCR).

Page 62: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

182 DEHNE ET AL.

1. Introduction

Data cube queries represent an important class of On-Line Analytical Processing (OLAP)queries in decision support systems. The precomputation of the different group-bys of adata cube (i.e., the forming of aggregates for every combination of GROUP BY attributes)is critical to improving the response time of the queries [16]. Numerous solutions forgenerating the data cube have been proposed. One of the main differences between themany solutions is whether they are aimed at sparse or dense relations [4, 17, 24, 25, 32].Solutions within a category can also differ considerably. For example, top-down data cubecomputations for dense relations based on sorting have different characteristics from thosebased on hashing.

To meet the need for improved performance and to effectively handle the increase in datasizes, parallel solutions for generating the data cube are needed. In this paper we presenta general framework for the efficient parallelization of existing data cube constructionalgorithms. We present load balanced and communication efficient partitioning strategieswhich generate a subcube computation for every processor. Subcube computations are thencarried out using existing sequential, external memory data cube algorithms. Balancing theload assigned to different processors and minimizing the communication overhead are thecore problems in achieving high performance on parallel systems. As discussed in [20, 22],this is a challenging problem.

At the heart of this paper are two partitioning strategies, one for top-down and one forbottom-up data cube construction algorithms. Good load balancing approaches generallymake use of application specific characteristics. Our partitioning strategies assign loads toprocessors by using metrics known to be crucial to the performance of data cube algorithms[1, 4, 25]. The bottom-up partitioning strategy balances the number of single attributeexternal sorts made by each processor [4]. The top-down strategy partitions a weighted treein which weights reflect algorithm specific cost measures such as estimated group-by sizes[1, 25].

The main advantages of our partitioning strategies for parallel data cube constructionare:

– Experimental data indicate that our top-down partitioning method produces very closeto optimal load balancing.

– Our bottom-up partitioning method produces tasks which require the same number ofsingle attribute sorts on each processor (the main cost for bottom-up data cube construc-tion).

– Our methods reduce inter-processor communication overhead by partitioning the loadin advance instead of computing each individual group-by in parallel (as proposed in[13, 14]).

– Our methods create a small number of coarse tasks. Only a very small number of tasksis assigned to each processor. This allows for sharing of prefixes and sort orders be-tween different group-by computations. A large number of tasks creates the problemthat such sharing can not be exploited, resulting in loss of performance as reported in[22].

Page 63: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 183

– Our methods maximize code reuse from existing sequential data cube implementationsby using existing sequential data cube algorithms for the subcube computations on eachprocessor. This supports the transfer of optimized sequential data cube code to the parallelsetting.

Our partitioning approaches are designed for standard, shared disk type, parallel ma-chines: p processors connected via an interconnection fabric where the processors havestandard-size local memories and access to a shared disk array. We have implemented andtested our parallel top-down data cube construction method. We implemented sequentialpipesort [1] in C++, and our parallel top-down data cube construction method (Section 4)in C++ with MPI [2]. We tested our code on an eight processor cluster, using a variety ofdifferent data sets with a range of sizes, dimensions, density, and skew. Comparison testswere performed on a SunFire 6800. The tests show that our partitioning strategies generatea close to optimal load balance between processors. The actual run times observed show anoptimal speedup of p.

The paper is organized as follows. Section 2 introduces the datacube problem anddescribes the parallel machine model underlying our partitioning approaches as well asthe input and the output configuration for our algorithms. Section 3 presents our parti-tioning approach for parallel bottom-up data cube generation and Section 4 outlines ourmethod for parallel top-down data cube generation. In Section 5 we indicate how our top-down cube parallelization can be easily modified to obtain an efficient parallelization ofthe ArrayCube method [32]. Section 6 presents the performance analysis of our paral-lel top-down partitioning approach. Section 7 compares our methods to previously pub-lished results and Section 8 concludes the paper and discusses possible extensions of ourmethods.

2. Preliminaries

The group-by operator in SQL computes aggregates on a set of attributes. To make interactiveanalysis possible, OLAP databases often precompute aggregates. For a given relation R, thedatacube operator refers to computing the aggregates for every combination of attributesof R. For example, given a relation R with attributes A, B, C, the datacube operator willresult in the computation of 23 = 8 group-bys: ABC, AB, BC, AC, A, B, C, all, where alldenotes the empty group-by. Precomputing the aggregates improves the response time ofaggregation queries and numerous solutions for the computation of all aggregates have beenproposed [1, 4, 10, 16, 24, 25]. Since the number of group-bys grows exponentially in thenumber of dimensions (for d dimensions, 2d group-bys are computed), algorithms makeuse of various problem- and data-dependent characteristics to improve efficiency.

Let R be the input data set representing d-dimensional data with |R| = N . We useA1, A2, . . . , Ad to denote the d attributes of relation R. Underlying all cube algorithmsis the lattice representing the 2d group-bys and their parent-child relationship. Figure 1shows this lattice for d = 4, where A, B, C, and D represent the attributes of the fourdimensions. The nodes of the lattice represent the group-bys and the edges indicate theparent-child relationship. Label AB of a node represents the group-by in which each entry

Page 64: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

184 DEHNE ET AL.

Figure 1. A 4-dimensional lattice.

is aggregated over all distinct combinations over AB. A group-by is a child of some par-ent group-by if the child can be computed from the parent by aggregating some of itsattributes. Parent-child relationships allow algorithms to share partitions, sorts, and partialsorts between different group-buys. For example, if the data has been sorted with respectto AB, then cuboid group-by A can be generated from AB without sorting and generatingABC requires only a sorting of blocks of entries. Cube algorithms differ on how they makeuse of these commonalities. Bottom-up approaches reuse previously computed sort ordersand generate more detailed group-buys from less detailed ones (a less detailed group-bycontains a subset of the attributes). Top-down approaches use more detailed group-bysto compute less detailed ones. Bottom-up approaches are better suited for sparse rela-tions. Relation R is sparse if N is much smaller than the number of possible values inthe given d-dimensional space. We present different partitioning and load balancing ap-proaches depending on whether a top-down or bottom-up sequential cube algorithm isused.

We conclude this section with a brief discussion of the underlying parallel model, thestandard shared disk parallel machine model. That is, we assume p processors connectedvia an interconnection fabric where processors have typical workstation size local memoriesand concurrent access to a shared disk array. For the purpose of parallel algorithm design,we use the Coarse Grained Multicomputer (CGM) model [5, 8, 15, 18, 27]. More precisely,we use the EM-CGM model [6, 7, 9] which is a multi-processor version of Vitter’s ParallelDisk Model [28–30]. For our parallel data cube construction methods we assume that thed-dimensional input data set R of size N is stored on the shared disk array. The output, i.e.the group-bys comprising the data cube, will be written to the shared disk array. Subsequentapplications may impose requirements on the output. For example, a visualization appli-cation may require storing group-by in striped format over the entire disk array to supportfast access to individual group-bys.

Page 65: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 185

3. Parallel bottom-up data cube construction

Bottom-up data cube construction methods calculate the group-bys in an order which em-phasizes the reuse of previously computed sorts and they generate more detailed group-buysfrom less detailed ones. Bottom-up methods are well suited for sparse relations and theysupport the selective computation of blocks in a group-by; e.g., generate only blocks whichspecify a user-defined aggregate condition [4].

Previous bottom-up methods include BUC [4] and PartitionCube (part of [24]). The mainidea underlying bottom-up methods can be captured as follows: if the data has previouslybeen sorted by attribute A, then creating an AB sort order does not require a completeresorting. A local resorting of A-blocks (blocks of consecutive elements that have the sameattribute A) can be used instead. The sorting of such A-blocks can often be performed in localmemory. Hence, instead of another external memory sort, the AB order can be created in onesingle scan through the disk. Bottom-up methods [4, 24] attempt to break the problem intoa sequence of single attribute sorts which share prefixes of attributes and can be performedin local memory with a single disk scan. As outlined in [4, 24], the total computation timeof these methods is dominated by the number of such single attribute sorts.

In this section we describe a partitioning of the group-by computations into p independentsubproblems. The partitioning generates subproblems which can be processed efficiently bybottom-up sequential cube methods. The goal of the partitioning is to balance the number ofsingle attribute sorts required by each subproblem and to ensure that each subproblem hasoverlapping sort sequences in the same way as for the sequential methods (thereby avoidingadditional work).

Let A1, . . . , Ad be the attributes of relation R and assume |A1| ≥ |A2|, ≥ · · · ≥ |Ad |where |Ai | is the number of different possible values for attribute Ai . As observed in [24],the set of all groups-bys of the data cube can be partitioned into those that contain A1 andthose that do not contain A1. In our partitioning approach, the groups-bys containing A1 willbe sorted by A1. We indicate this by saying that they contain A1 as a prefix. The group-bysnot containing A1 (i.e., A1 is projected out) contain A1 as a postfix. We then recurse withthe same scheme on the remaining attributes. We shall utilize this property to partition thecomputation of all group-bys into independent subproblems. The load between subproblemswill be balanced and they will have overlapping sort sequences in the same way as for thesequential methods. In the following we give the details of our partitioning method.

Let x , y, z be sequences of attributes representing sort orders and let A be an arbi-trary single attribute. We introduce the following definition of sets of attribute sequencesrepresenting sort orders (and their respective group-bys):

S1(x, A, z) = {x, xA} (1)

Si (x, Ay, z) = Si−1(xA, y, z) ∪ Si−1(x, y, Az), 2 ≤ i ≤ log p + 1 (2)

The entire data cube construction corresponds to the set Sd(∅, A1 . . . Ad , ∅) of sort or-ders and respective group-bys, where d is the dimension of the the data cube. We referto i as the rank of Si (. . .). The set Sd(∅, A1 . . . Ad , ∅) is the union of two subsets ofrank d − 1: Sd−1(A1, A2 . . . Ad , ∅) and Sd−1(∅, A2 . . . Ad , A1). These, in turn, are the

Page 66: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

186 DEHNE ET AL.

Figure 2. Partitioning for a 4-dimensional data cube with attributes A, B, C, D. The 8 S1-sets correspond to the16 group-buys determined for four attributes.

union of four subsets of rank d − 2: Sd−2(A1 A2, A3 . . . Ad , ∅), Sd−2(A1, A3 . . . Ad , A2),Sd−2(A2, A3 . . . Ad , A1), and Sd−2(∅, A3 . . . Ad , A2 A1). A complete example for a4-dimensional data cube with attributes A, B, C, D is shown in figure 2.

For the sake of simplifying the discussion, we assume that p is a power of 2, p = 2k .Consider the 2p S-sets of rank d − k − 1. Let β = (B1, B2, . . . , B2p) be these 2p sets inthe order defined by Eq. (2). Define

Shuffle(β) = <B1 ∪ B2p, B2 ∪ B2p−1, B3 ∪ B2p−2, . . . , B p ∪ B p+1>

= <1, . . . , p>

Our partitioning assigns set i = Bi ∪ B2p−i+1 to processor Pi , 1 ≤ i ≤ p, as summarizedin Algorithm 1.

ALGORITHM 1. Parallel Bottom-Up Cube Construction.Each processor Pi , 1 ≤ i ≤ p, performs the following steps, independently and inparallel:(1) Determine the two sets forming i as described below.(2) Compute all group-bys in i using a sequential (external-memory) bottom-up

cube construction method.—End of Algorithm—

We illustrate the partitioning using an example with d = 10 and and p = 23 = 8. For thesevalues, we generate 16 S-sets of rank 6. Giving only the indices of attributes A1, A2, A3,

Page 67: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 187

Figure 3. -sets assigned to 8 processors when d = 10; symbol • represents a projected-out attribute and∗ represents an existing attribute.

and A4, we have

β = (1234, 123, 124, 12, 143, 13, 14, 1, 234, 23, 24, 2, 34, 3, 4, ∅)

Each processor is assigned the computation of 27 group-bys as shown in figure 3. If everyprocessor has access to its own copy of relation R, then a processor performs k + 1 singleattribute sorts to generate the data in the ordering needed for its group-bys. If there is onlyone copy of R, read-conflicts can be avoided by sorting the sequences using a binomialheap broadcast pattern [19]. Doing so results in every processor Pi receiving its two sortedsequences forming i after the time needed for k + 1 single attribute sorts. Figure 4 showsthe sequence of sorts for the 8-processor example. The index inside the circles indicates theprocessor assignment; i.e., processor 1 performs a total of four single attribute sorts on theoriginal relation R, starting with the sort on attribute A1. Using binomial heap properties, itfollows that a processor does at most k +1 single attribute sorts and the 2p sorted sequencesare available after the time needed for k + 1 sorts.

Figure 4. Binomial heap structure for generating the 2p Gamma-sets without read conflicts.

Page 68: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

188 DEHNE ET AL.

Algorithm 1 can easily be generalized to values of p which are not powers of 2. We alsonote that Algorithm 1 requires p ≤ 2d−1. This is usually the case in practice. However, ifa parallel algorithm is needed for larger values of p, the partitioning strategy needs to beaugmented. Such an augmentation could, for example, be a partitioning strategy based onthe number of data items for a particular attribute. This would be applied after partitioningbased on the number of attributes has been done. Since the range p ∈ {20 . . . 2d−1} coverscurrent needs with respect to machine and dimension sizes, we do not further discuss suchaugmentations in this paper.

The following four properties summarize the main features of Algorithm 1 that make itload balanced and communication efficient:

– The computation of each group-by is assigned to a unique processor.– The calculation of the group-bys in i , assigned to processor Pi , requires the same

number of single attribute sorts for all 1 ≤ i ≤ p.– The sorts performed at processor Pi share prefixes of attributes in the same way as in

[4, 24] and can be performed with disk scans in the same manner as in [4, 24].– The algorithm requires no inter-processor communication.

4. Parallel top-down data cube construction

Top-down approaches for computing the data cube, like the sequential PipeSort, Pipe Hash,and Overlap methods [1, 10, 25], use more detailed group-bys to compute less detailed onesthat contain a subset of the attributes of the former. They apply to data sets where the numberof data items in a group-by can shrink considerably as the number of attributes decreases(data reduction). The PipeSort, PipeHash, and Overlap methods select a spanning tree Tof the lattice, rooted at the group-by containing all attributes. PipeSort considers two casesof parent-child relationships. If the ordered attributes of the child are a prefix of the orderedattributes of the parent (e.g., ABCD → ABC) then a simple scan is sufficient to createthe child from the parent. Otherwise, a sort is required to create the child. PipeSort seeksto minimize the total computation cost by computing minimum cost matchings betweensuccessive layers of the lattice. PipeHash uses hash tables instead of sorting. Overlapattempts to reduce sort time by utilizing the fact that overlapping sort orders do not alwaysrequire a complete new sort. For example, the ABC group-by has A partitions that can besorted independently on C to produce the AC sort order. This may permit independent sortsin memory rather than always using external memory sort.

Next, we outline a partitioning approach which generates p independent subproblems,each of which can be solved by one processor using an existing external-memory top-downcube algorithm. The first step of our algorithm determines a spanning tree T of the lattice byusing one of the existing approaches like PipeSort, PipeHash, and Overlap, respectively.To balance the load between the different processors we next perform a storage estimationto determine approximate sizes of the group-bys in T . This can be done, for example,by using methods described in [11, 26]. We now work with a weighted tree. The mostcrucial part of our solution is the partitioning of the tree. The partitioning of T into subtreesinduces a partitioning of the data cube problem into p subproblems (subsets of group-bys).

Page 69: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 189

Determining an optimal partitioning of the weighted tree is easily shown to be an NP-complete problem (by making, for example, a reduction to processor scheduling). Since theweights of the tree represent estimates, a heuristic approach which generates p subproblemswith “some control” over the sizes of the subproblems holds the most promise. While wewant the sizes of the p subproblems balanced, we also want to minimize the number ofsubtrees assigned to a processor. Every subtree may require a scanning of the entire data setR and thus too many subtrees can result in poor I /O performance. The solution we developbalances these two considerations.

Our heuristics makes use of a related partitioning problem on trees for which efficientalgorithms exist, the min-max tree k-partitioning problem [3] defined as follows: Given atree T with n vertices and a positive weight assigned to each vertex, delete k edges in thetree such that the largest total weight of a resulting subtree is minimized.

The min-max tree k-partitioning problem has been studied in [3, 12, 23]. These methodsassume that the weights are fixed. Note that, our partitioning problem on T is differentin that, as we cut a subtree T ′ out of T , an additional cost is introduced because thegroup-by associated with the root of T ′ must now be computed from scratch through aseparate sort. Hence, when cutting T ′ out of T , the weight of the root of T ′ has to beincreased accordingly. We have adapted the algorithm in [3] to account for the changes ofweights required. This algorithm is based on a pebble shifting scheme where k pebbles areshifted down the tree, from the root towards the leaves, determining the cuts to be made.In our adapted version, as cuts are made, the cost for the parent of the new partition isadjusted to reflect the cost of the additional sort. Its original cost is saved in a hash tablefor possible future use since cuts can be moved many times before reaching their finalposition. In the remainder, we shall refer to this method as the modified min-max treek-partitioning.

However, even a perfect min-max k-partitioning does not necessarily result in a parti-tioning of T into subtrees of equal size, and nor does it address tradeoffs arising from thenumber of subtrees assigned to a processor. We use tree-partitioning as an initial step forour partitioning. To achieve a better distribution of the load we apply an over partitioningstrategy: instead of partitioning the tree T into p subtrees, we partition it into s × p subtrees,where s is an integer, s ≥ 1. Then, we use a “packing heuristic” to determine which subtreesbelong to which processors, assigning s subtrees to every processor. Our packing heuristicconsiders the weights of the subtrees and pairs subtrees by weights to control the numberof subtrees. It consists of s matching phases in which the p largest subtrees (or groups ofsubtrees) and the p smallest subtrees (or groups of subtrees) are matched up. Details aredescribed in Step 2b of Algorithm 2.

ALGORITHM 2. Sequential Tree-partition(T , s, p).Input: A spanning tree T of the lattice with positive weights assigned to the nodes (represent-ing the cost to build each node from it’s ancestor in T ). Integer parameters s (oversamplingratio) and p (number of processors).Output: A partitioning of T into p subsets 1, . . . , p of s subtrees each.(1) Compute a modified min-max tree s × p-partitioning of T into s × p subtrees T1, . . . ,

Ts×p.

Page 70: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

190 DEHNE ET AL.

(2) Distribute subtrees T1, . . . , Ts×p among the p subsets 1, . . . , p, s subtrees per subset,as follows:(2a) Create s × p sets of trees named ϒi , 1 ≤ i ≤ sp, where initially ϒi = {Ti }. The

weight of ϒi is defined as the total weight of the trees in ϒi .(2b) For j = 1 to s − 1

• Sort the ϒ-sets by weight, in increasing order. W.L.O.G., let ϒ1, . . . ,

ϒsp−( j−1)p be the resulting sequence.• Set ϒi := ϒi ∪ ϒsp−( j−1)p−i+1, 1 ≤ i ≤ p.• Remove ϒsp−( j−1)p−i+1, 1 ≤ i ≤ p.

(2c) Set i = ϒi , 1 ≤ i ≤ p.—End of Algorithm—

The above tree partition algorithm is embedded into our parallel top-down data cubeconstruction algorithm. Our method provides a framework for parallelizing any sequen-tial top-down data cube algorithm. An outline of our approach is given in the followingAlgorithm 3.

ALGORITHM 3. Parallel Top-Down Cube Construction.Each processor Pi , 1 ≤ i ≤ p, performs the following steps independently and inparallel:

(1) Apply the storage estimation method in [11, 26] to determine the approximatesizes of all group-bys in T .

(2) Select a sequential top-down cube construction method (e.g., PipeSort, PipeHash, or Overlap) and compute the spanning tree T of the lattice as used bythis method. Compute the weight of each node of T : the estimated cost to buildeach node from it’s ancestor in T .

(3) Execute Algorithm Tree-partition(T, s, p) as shown above, creating p sets1, . . . , p. Each set i contains s subtrees of T .

(4) Compute all group-bys in subset i using the sequential top-down cube con-struction method chosen in Step 1.

—End of Algorithm—

Our performance results described in Section 6 show that an over partitioning with s = 2or 3 achieves very good results with respect to balancing the loads assigned to the processors.This is an important result since a small value of s is crucial for optimizing performance.

5. Parallel array-based data cube construction

Our method in Section 4 can be easily modified to obtain an efficient parallelization of theArrayCube method presented in [32]. The ArrayCube method is aimed at dense data cubesand structures the raw data set in a d-dimensional array stored on disk as a sequence of“chunks”. Chunking is a way to divide the d-dimensional array into small size d-dimensionalchunks where each chunk is a portion containing a data set that fits into a disk block. When afixed sequence of such chunks is stored on disk, the calculation of each group-by requires a

Page 71: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 191

certain amount of buffer space [32]. The ArrayCube method calculates a minimum memoryspanning tree of group-bys, MMST, which is a spanning tree of the lattice such that the totalamount of buffer space required is minimized. The total number of disk scans required forthe computation of all group-bys is the total amount of buffer space required divided by thememory space available. The ArrayCube method can therefore be parallelized by simplyapplying Algorithm 3 with T being the MMST.

6. Experimental performance analysis

We have implemented and tested our parallel top-down data cube construction methodpresented in Section 4. We implemented sequential pipesort [1] in C++, and our paralleltop-down data cube construction method (Section 4) in C++ with MPI [2]. Most of therequired graph algorithms, as well as data structures like hash tables and graph representa-tions, were drawn from the LEDA library [21]. Still, the implementation took one personyear of full time work. We chose to implement our parallel top-down data cube construc-tion method rather than our parallel bottom-up data cube construction method because theformer has more tunable parameters that we wish to explore. As our primary parallel hard-ware platform, we use a PC cluster consisting of a front-end machine and eight processors.The front-end machine is used to partition the lattice and distribute the work among theother 8 processors. The front-end machine is an IBM Netfinity server with two 9 GB SCSIdisks, 512 MB of RAM and a 550-MHZ Pentium processor. The processors are 166 MHZPentiums with 2G IDE hard drives and 32 MB of RAM, except for one processor whichis a 133 MHZ Pentium. The processors run LINUX and are connected via a 100 Mbit FastEthernet switch with full wire speed on all ports. Clearly, this is a very low end, older, hard-ware platform. The experiments reported in the remainder of this section represent severalweeks of 24 hr/day testing and the PC cluster platform described above has the advantageof being available exclusively for our experiments without any other user disturbing ourmeasurements. For our main goal of studying the speedup obtained by our parallel methodrather than absolute times, this platform proved sufficient. To verify that our results alsohold for newer machines with faster processors, more memory per processor, and higherbandwidth, we then ported our code to a SunFire 6800 and performed comparison testson the same data sets. The SunFire 6800 used is a very recent SUN multiprocessor withSun UltraSPARC III 750 MHz processors running Solaris 8, 24 GB of RAM and a Sun T3shared disk.

Figure 5 shows the PC cluster running time observed as a function of the number of pro-cessors used. For the same data set, we measured the sequential time (sequential pipesort [1])and the parallel time obtained through our parallel top-down data cube construction method(Section 4), using an oversampling ratio of s = 2. The data set consisted of 1,000,000records with dimension 7. Our test data values were uniformly distributed over 10 val-ues in each dimension. Figure 5 shows the running times of the algorithm as we increasethe number of processors. There are three curves shown. The runtime curve shows thetime taken by the slowest processor (i.e. the processor that received the largest workload).The second curve shows the average time taken by the processors. The time taken by thefront-end machine, to partition the lattice and distribute the work among the compute nodes,

Page 72: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

192 DEHNE ET AL.

Figure 5. PC cluster running time in seconds as a function of the number of processors. (Fixed parameters: Datasize = 1,000,000 rows. Dimensions = 7. Experiments per data point = 5).

was insignificant. The theoretical optimum curve shown in figure 5 is the sequential pipesorttime divided by the number of processors used.

We observe that the runtime obtained by our code and the theoretical optimum areessentially identical. That is, for an oversampling ratio of s = 2, an optimal speedup of pis observed. (The anomaly in the runtime curve at p = 4 is due to the slower 133 MHZPentium processor.)

Interestingly, the average time curve is always below the theoretical optimum curve, andeven the runtime curve is sometimes below the theoretical optimum curve. One would haveexpected that the runtime curve would always be above the theoretical optimum curve.We believe that this superlinear speedup is caused by another effect which benefits ourparallel method: improved I/O. When sequential pipesort is applied to a 10 dimensionaldata set, the lattice is partitioned into pipes of length up to 10. In order to process a pipe oflength 10, pipesort needs to write to 10 open files at the same time. It appears that underLINUX, the number of open files can have a considerable impact on performance. For100,000 records, writing them to 4 files each took 8 seconds on our system. Writing themto 6 files each took 23 seconds, not 12, and writing them to 8 files each took 48 seconds,not 16. This benefits our parallel method, since we partition the lattice first and then applypipesort to each part. Therefore, the pipes generated in the parallel method are considerablyshorter.

In order to verify that our results also hold for newer machines with faster processors,more memory per processor, and higher bandwidth, we ported our code to a SunFire 6800and performed comparison tests on the same data sets. Figure 6 shows the running timesobserved for the SunFire 6800. The absolute running times observed are considerablyfaster, as expected. The SunFire is approximately 4 times faster than the PC cluster. Most

Page 73: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 193

Figure 6. SunFire 6800 running time in seconds as a function of the number of processors. Same data set as infigure 5.

importantly, the shapes of the curves are essentially the same as for the PC cluster. Theruntime (slowest proc.) and average time curves are very similar and are both very closeto the theoretical optimum curve. That is, for an oversampling ratio of s = 2, an optimalspeedup of p is also observed for the SunFire 6800. The larger SunFire installation alsoallowed us to test our code for a larger number of processors. As shown in figure 6, we stillobtain optimal speedup p when using 16 processors on the same dataset.

Figure 7 shows the PC cluster running times of our top-down data cube parallelization aswe increase the data size from 100,000 to 1,000,000 rows. The main observation is that theparallel runtime increases slightly more than linear with respect to the data size which isconsistent with the fact that sorting requires time O(n log n). Figure 7 shows that our paralleltop-down data cube construction method scales gracefully with respect to the data size.

Figure 8 shows the PC cluster running time as a function of the oversampling ratio s.We observe that, for our test case, the parallel runtime (i.e. the time taken by the slowestprocessor) is best for s = 3. This is due to the following tradeoff. Clearly, the workloadbalance improves as s increases. However, as the total number of subtrees, s × p, generatedin the tree partitioning algorithm increases, we need to perform more sorts for the rootnodes of these subtrees. The optimal tradeoff point for our test case is s = 3. It is importantto note that the oversampling ratio s is a tunable parameter. The best value for s dependson a number of factors. What our experiments show is that s = 3 is sufficient for the loadbalancing. However, as the data set grows in size, the time for the sorts of the root nodesof the subtrees increases more than linear whereas the effect on the imbalance is linear. Forsubstantially larger data sets, e.g. 1G rows, we expect the optimal value for s to be s = 2.

Figure 9 shows the PC cluster running time of our top-down data cube parallelization aswe increase the dimension of the data set from 2 to 10. Note that, the number of group-bysthat must be computed grows exponentially with respect to the dimension of the data set. Infigure 9, we observe that the parallel running time grows essentially linear with respect to

Page 74: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

194 DEHNE ET AL.

Figure 7. PC cluster running time in seconds as a function of the data size. (Fixed parameters: Number ofprocessors = 8. Dimensions = 7. Experiments per data point = 5).

Figure 8. PC cluster running time in seconds as a function of the oversampling ratio s. (Fixed parameters: Datasize = 1,000,000 rows. Number of processors = 8. Dimensions = 7. Experiments per data point = 5).

Page 75: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 195

Figure 9. PC cluster running time in seconds as a function of the number of dimensions. (Fixed parameters: Datasize = 200,000 rows. Number of processors = 8. Experiments per data point = 5.) Note: Work grows exponentiallywith respect to the number of dimensions.

the output size. We also tried our code on very high dimensional data where the size of theoutput becomes extremely large. For example, we executed our parallel algorithm for a 15-dimensional data set of 10,000 rows, and the resulting data cube was of size more than 1G.

Figure 10 shows the PC cluster running time of our top-down data cube parallelization aswe increase the cardinality in each dimension, that is the number of different possible data

Figure 10. PC cluster running time in seconds as a function of the cardinality, i.e. number of different possible datavalues in each dimension. (Fixed parameters: Data size = 200,000 rows. Number of processors = 8. Dimensions =7. Experiments per data point = 5).

Page 76: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

196 DEHNE ET AL.

Figure 11. PC cluster running time in seconds as a function of the skew of the data values in each dimension,based on ZIPF. (Fixed parameters: Data size = 200,000 rows. Number of processors = 8. Dimensions = 7.Experiments per data point = 5).

values in each dimension. Recall that, top-down pipesort [1] is aimed at dense data cubes.Our experiments were performed for 3 cardinality levels: 5, 10, and 100 possible values perdimension. The results shown in figure 6 confirm our expectation that the method performsbetter for denser data.

Figure 11 shows the PC cluster running time of our top-down data cube parallelization fordata sets with skewed distribution. We used the standard ZIPF distribution in each dimensionwith α = 0 (no skew) to α = 3. Since data reduction in top-down pipesort [1] increases withskew, the total time observed is expected to decrease with skew which is exactly whatwe observe in figure 11. Our main concern regarding our parallelization method was howbalanced the partitioning of the tree would be in the presence of skew. The main observationin figure 11 is that the relative difference between runtime (slowest processor) and averagetime does not increase as we increase the skew. This appears to indicate that our partitioningmethod is robust in the presence of skew.

7. Comparison with previous results

In this Section we summarize previous results on parallel data cube computation and com-pare them to the results presented in this paper.

In [13, 14], the authors observe that a group-by computation is essentially a parallel prefixoperation and reduce the data cube problem to a sequence of parallel prefix computations. Noimplementation of this method is mentioned and no experimental performance evaluationis presented. This method creates large communication overhead and will most likely showunsatisfactory speedup. The methods in [20, 22] as well as the methods presented in this

Page 77: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 197

paper reduce communication overhead by partitioning the load and assigning sets of group-by computations to individual processors. As discussed in [20, 22], balancing the loadassigned to different processors is a hard problem. The approach in [20] uses a simplegreedy heuristic to parallelize hash-based data cube computation. As observed in [20], thissimple method is not scalable. Load balance and speedup are not satisfactory for morethan 4 processors. A subsequent paper by the same group [31] focuses on the overlapbetween multiple data cube computations in the sequential setting. The approach in [22]considers the parallelization of sort-based data cube construction. It studies parallel bottom-up Iceberg-cube computation. Four different methods are presented: RP, RPP, ASL, andPT. Experimental results presented indicate that ASL, and PT have the better performanceamong those four. The main reason is that RP and RPP show weak load balancing. PT issomewhat similar to our parallel bottom-up data cube construction method presented inSection 3 since PT also partitions the bottom-up tree. However, PT partitions the bottom-uptree simply into subtrees with equal numbers of nodes, and it requires considerably moretasks than processors to obtain good load balance. As observed in [22], when a largernumber of tasks is required, then performance problems arise because such an approachreduces the possibility of sharing of prefixes and sort orders between different group-bycomputations. In contrast, our parallel bottom-up method in Section 3 assigns only twotasks to each processor. These tasks are coarse grained which greatly improves sharing ofprefixes and sort orders between different group-by computations. Therefore, we expectthat our method will not have a decrease in performance for a larger number of processorsas observed in [22]. The ASL method uses a parallel top-down approach, using a skiplistto maintain the cells in each group-by. ASL is parallelized by making the constructionof each group-by a separate task, hoping that a large number of tasks will create a goodoverall load balancing. It uses a simple greedy approach for assigning tasks to processorsthat is similar to [20]. Again, as observed in [22], the large number of tasks brings withit performance problems because it reduces the possibility of sharing of prefixes and sortorders between different group-by computations. In contrast, our parallel top-down methodin Section 4 creates only very few coarse tasks. More precisely, our algorithm assigns stasks (subtrees) to each processor, where s is the oversampling ratio. As shown in Section 6,an oversampling ratio s ≤ 3 is sufficient to obtain good load balancing. In that sense, ourmethod answers the open question in [22] on how to obtain good load balancing withoutcreating so many tasks. This is also clearly reflected in the experimental performance ofour methods in comparison to the experiments reported in [22]. As observed in [22], theirexperiments (figure 10 in [22]) indicate that ASL obtains essentially zero speedup whenthe number of processors is increased from 8 to 16. In contrast, our experiments (figure 6of Section 6) show that our parallel top-down method from Section 4 still doubles its speedwhen the number of processors is increased from 8 to 16 and obtains optimal speedup pwhen using 16 processors.

8. Conclusion and future work

We presented two different, partitioning based, data cube parallelizations for standard shareddisk type parallel machines. Our partitioning strategies for bottom-up and top-down data

Page 78: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

198 DEHNE ET AL.

cube parallelization balance the loads assigned to the individual processors, where the loadsare measured as defined by the original proponents of the respective sequential methods.Subcube computations are carried out using existing sequential data cube algorithms. Ourtop-down partitioning strategy can also be easily extended to parallelize the ArrayCubemethod. Experimental results indicate that our partitioning methods are efficient in practice.Compared to existing parallel data cube methods, our parallelization approach brings asignificant reduction in inter-processor communication and has the important practicalbenefit of enabling the re-use of existing sequential data cube code.

A possible extension of our data cube parallelization methods is to consider a sharednothing parallel machine model. If it is possible to store a duplicate of the input data set Ron each processor’s disk, then our method can be easily adapted for such an architecture. Thisis clearly not always possible. It does solve most of those cases where the total output size isconsiderably larger than the input data set; for example sparse data cube computations. Thedata cube can be several hundred times as large as R. Sufficient total disk space is necessaryto store the output (as one single copy distributed over the different disks) and a p timesduplication of R may be smaller than the output. Our data cube parallelization method wouldthen partition the problem in the same way as described in Sections 3 and 4, and subcubecomputations would be assigned to processors in the same way as well. When computingits subcube, each processor would read R from its local disk. For the output, there are twoalternatives. Each processor could simply write the subcubes generated to its local disk.This could, however, create a bottleneck if there is, for example, a visualization applicationfollowing the data cube construction which needs to read a single group-by. In such a case,each group-by should be distributed over all disks, for example in striped format. To obtainsuch a data distribution, all processors would not write their subcubes directly to their localdisks but buffer their output. Whenever the buffers are full, they would be permuted overthe network. In summary we observe that, while our approach is aimed at shared diskparallel machines, its applicability to shared nothing parallel machines depends mainly onthe distribution and availability of the input data set R. An interesting open problem is toidentify the “ideal” distribution of input R among the p processors when a fixed amount ofreplication of the input data is allowed (i.e., R can be copied r times, 1 ≤ r < p).

Another interesting question for future work is the relationship between top-down andbottom-up data cube computation in the parallel setting. These are two conceptually verydifferent methods. The existing literature suggests that bottom-up methods are better suitedfor high dimensional data. So far, we have implemented our parallel top-down data cubemethod which took about one person year of full time work. We chose to implementthe top-down method because it has more tunable parameters to be discovered throughexperimentation. A possible future project could be to implement our parallel bottom-up data cube method in a similar environment (same compiler, message passing library,data structure libraries, disk access methods, etc.) and measure the various trade-off pointsbetween the two methods. As indicated in [22], the critical parameters for parallel bottom-up data cube computation are similar: good load balance and a small number of coarsetasks. This leads us to believe that our parallel bottom-up method should perform well.Compared to our parallel top-down method, our parallel bottom-up method has fewerparameters available for fine-tuning the code. Therefore, the trade-off points in the parallel

Page 79: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 199

setting between top-down and bottom-up methods may be different from the sequentialsetting.

Relatively little work has been done on the more difficult problem of generating partialdata cubes, that is, not the entire data cube but only a given subset of group-bys. Given alattice and a set of selected group-bys that are to be generated, the challenge is in decidingwhich other group-bys should be computed in order to minimize the total cost of computingthe partial data cube. In many cases computing intermediate group-bys that are not in theselected set, but from which several views in the selected set can be computed cheaply, willreduce the overall computation time. Sarawagi et al. [25] suggest an approach based onaugmenting the lattice with additional vertices (to represent all possible orderings of eachview’s attributes) and additional edges (to represent all relationships between views). Thena Minimum Steiner Tree approximation algorithm is run to identify some number of “inter-mediate” nodes (so-called Steiner points) that can be added to the selected subset to “best”reduce the overall cost. An approximation algorithm is used because the optimal MinimumSteiner Tree problem is NP-Complete. The intermediate nodes introduced by this methodare, of course, to be drawn from the non-selected nodes in the original lattice. By adding theseadditional nodes, the cost of computing the selected nodes is reduced. Although theoreti-cally neat this approach is not effective in practice. The problem is that the augmented latticehas far too many vertices and edges to be processed efficiently. For example, in a 6 dimen-sional partial data cube the number of vertices and edges in the augmented lattice increaseby factors of 30 and 8684 respectively. For a 8 dimensional partial data cube the numberof vertices and edges increase by factors of 428 and 701,346 respectively. The augmentedlattice for a 9 dimensional partial data cube has more than 2,000,000,000 edges. Anotherapproach is clearly necessary. The authors are currently implementing new algorithms forgenerating partial data cubes. We consider this an important area of future research.

Acknowledgments

The authors would like to thank Steven Blimkie, Zimmin Chen, Khoi Manh Nguyen, ThomasPehle, and Suganthan Sivagnanasundaram for their contributions towards the implemen-tation described in Section 6. The first, second, and fourth author’s research was partiallysupported by the Natural Sciences and Engineering Research Council of Canada. The thirdauthor’s research was partially supported by the National Science Foundation under Grant9988339-CCR.

References

1. S. Agarwal, R. Agarwal, P.M. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Srawagi, “Onthe computation of multi-dimensional aggregates,” in Proc. 22nd VLDB Conf., 1996, pp. 506–521.

2. Argonne National Laboratory, http://www-unix.mcs.anl.gov/mpi/index.html. The Message Passing Interface(MPI) Standard, 2001.

3. R.I. Becker, Y. Perl, and S.R. Schach, “A shifting algorithm for min-max tree partitioning,” J. ACM, vol. 29,pp. 58–67, 1982.

4. K. Beyer and R. Ramakrishnan, “Bottom-up computation of sparse and iceberg cubes,” in Proc. of 1999 ACMSIGMOD Conference on Management of Data, 1999, pp. 359–370.

Page 80: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

200 DEHNE ET AL.

5. T. Cheatham, A. Fahmy, D.C. Stefanescu, and L.G. Valiant, “Bulk synchronous parallel computing—Aparadigm for transportable software,” in Proc. of the 28th Hawaii International Conference on System Sciences.vol. 2: Software Technology, 1995, pp. 268–275.

6. F. Dehne, W. Dittrich, and D. Hutchinson, “Efficient external memory algorithms by simulating coarse-grainedparallel algorithms,” in Proc. 9th ACM Symposium on Parallel Algorithms and Architectures (SPAA’97), 1997,pp. 106–115.

7. F. Dehne, W. Dittrich, D. Hutchinson, and A. Maheshwari, “Parallel virtual memory,” in Proc. 10th AnnualACM-SIAM Symposium on Discrete Algorithms, 1999, pp. 889–890.

8. F. Dehne, A. Fabri, and A. Rau-Chaplin, “Scalable parallel computational geometry for coarse grained multi-computers,” in ACM Symp. Computational Geometry, 1993, pp. 298–307.

9. F. Dehne, D. Hutchinson, and A. Maheshwari, “Reducing i/o complexity by simulating coarse grained parallelalgorithms,” in Proc. 13th International Parallel Processing Symposium (IPPS’99), 1999, pp. 14–20.

10. P.M. Deshpande, S. Agarwal, J.F. Naughton, and R. Ramakrishnan, “Computation of multidimensionalaggregates,” Technical Report 1314, University of Wisconsin, Madison, 1996.

11. P. Flajolet and G.N. Martin, “Probablistic counting algorithms for database applications,” Journal ofComputer and System Sciences, vol. 31, no. 2, pp. 182–209, 1985.

12. G.N. Frederickson, “Optimal algorithms for tree partitioning,” in Proc. ACM-SIAM Symposium on DiscreteAlgorithms (SODA), 1991, pp. 168–177.

13. S. Goil and A. Choudhary, “High performance OLAP and data mining on parallel computers,” Journal of DataMining and Knowledge Discovery, vol. 1, no. 4, 1997.

14. S. Goil and A. Choudhary, “A parallel scalable infrastructure for OLAP and data mining,” in Proc. InternationalData Engineering and Applications Symposium (IDEAS’99), Montreal, Aug. 1999.

15. M. Goudreau, K. Lang, S. Rao, T. Suel, and T. Tsantilas, “Towards efficiency and portability: Programmingwith the BSP model,” in Proc. 8th ACM Symposium on Parallel Algorithms and Architectures (SPAA ’96),pp. 1–12, 1996.

16. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, “Datacube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals,” J. Data Mining andKnowledge Discovery, vol. 1, no. 1, pp. 29–53, 1997.

17. V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing data cubes efficiently” SIGMOD Record(ACM Special Interest Group on Management of Data), vol. 25, no. 2, pp. 205–216, 1996.

18. J. Hill, B. McColl, D. Stefanescu, M. Goudreau, K. Lang, S. Rao, T. Suel, T. Tsantilas, and R. Bisseling,“BSPlib: The BSP programming library,” Parallel Computing, vol. 24, no. 14, pp. 1947–1980, 1998.

19. V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing, The Benjamin/Cummings Publishing Company, Inc: Menlo Park, CA, 1994.

20. H. Lu, X. Huang, and Z. Li, “Computing data cubes using massively parallel processors,” in Proc. 7th ParallelComputing Workshop (PCW’97) Canberra, Australia, 1997.

21. Max Planck Institute, LEDA. http://www.mpi-sb.mpg.de/LEDA/.22. R.T. Ng, A. Wagner, and Y. Yin, “Iceberg-cube computation with pc clusters,” in Proc. of 2001 ACM SIGMOD

Conference on Management of Data, May 2001, pp. 25–36.23. Y. Perl and U. Vishkin, “Efficient implementation of a shifting algorithm,” Disc. Appl. Math., vol. 12,

pp. 71–80, 1985.24. K.A. Ross and D. Srivastava, “Fast computation of sparse datacubes,” in Proc. 23rd VLDB Conference, 1997,

pp. 116–125.25. S. Sarawagi, R. Agrawal, and A. Gupta, “On computing the data cube,” Technical Report RJ10026, IBM

Almaden Research Center, San Jose, CA, 1996.26. A. Shukla, P. Deshpende, J.F. Naughton, and K. Ramasamy, “Storage estimation for mutlidimensional aggre-

gates in the presence of hierarchies,” in Proc. 22nd VLDB Conference, 1996, pp. 522–531.27. J.F. Sibeyn and M. Kaufmann, “BSP-like external-memory computation,” in Proc. of 3rd Italian Conf. on

Algorithms and Complexity (CIAC-97), LNCS, vol. 1203, Springer: New York, 1997, pp. 229–240.28. D.E. Vengroff and J.S. Vitter, “I/o-efficient scientific computation using tpie,” in Proc. Goddard Conference

on Mass Storage Systems and Technologies, 1996, pp. 553–570.29. J.S. Vitter, “External memory algorithms,” in Proc. 17th ACM Symp. on Principles of Database Systems

(PODS ’98), 1998, pp. 119–128.

Page 81: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

PARALLELIZING THE DATA CUBE 201

30. J.S. Vitter and E.A.M. Shriver, “Algorithms for parallel memory. i: Two-level memories,” Algorithmica,vol. 12, no. 2/3, pp. 110–147, 1994.

31. J.X. Yu and H. Lu, “Multi-cube computation,” in Proc. 7th International Symposium on Database Systemsfor Advanced Applications, Hong Kong, April 18–21, 2001.

32. Y. Zhao, P.M. Deshpande, and J.F. Naughton, “An array-based algorithm for simultaneous multidimensionalaggregates,” in Proc. ACM SIGMOD Conf., 1997, pp. 159–170.

Page 82: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Page 83: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Distributed and Parallel Databases, 11, 203–229, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Boosting Algorithms for Paralleland Distributed Learning

ALEKSANDAR LAZAREVIC [email protected] OBRADOVIC [email protected] for Information Science and Technology, Temple University, 303 Wachman Hall (038-24),1805 N. Broad St., Philadelphia, PA 19122-6094, USA

Recommended by: Mohammed J. Zaki

Abstract. The growing amount of available information and its distributed and heterogeneous nature has amajor impact on the field of data mining. In this paper, we propose a framework for parallel and distributedboosting algorithms intended for efficient integrating specialized classifiers learned over very large, distributedand possibly heterogeneous databases that cannot fit into main computer memory. Boosting is a popular techniquefor constructing highly accurate classifier ensembles, where the classifiers are trained serially, with the weightson the training instances adaptively set according to the performance of previous classifiers. Our parallel boostingalgorithm is designed for tightly coupled shared memory systems with a small number of processors, with anobjective of achieving the maximal prediction accuracy in fewer iterations than boosting on a single processor.After all processors learn classifiers in parallel at each boosting round, they are combined according to theconfidence of their prediction. Our distributed boosting algorithm is proposed primarily for learning from severaldisjoint data sites when the data cannot be merged together, although it can also be used for parallel learningwhere a massive data set is partitioned into several disjoint subsets for a more efficient analysis. At each boostinground, the proposed method combines classifiers from all sites and creates a classifier ensemble on each site.The final classifier is constructed as an ensemble of all classifier ensembles built on disjoint data sets. The newproposed methods applied to several data sets have shown that parallel boosting can achieve the same or evenbetter prediction accuracy considerably faster than the standard sequential boosting. Results from the experimentsalso indicate that distributed boosting has comparable or slightly improved classification accuracy over standardboosting, while requiring much less memory and computational time since it uses smaller data sets.

Keywords: parallel boosting, distributed boosting, heterogeneous databases, boosting specialized experts

1. Introduction

The recent, explosive growth of information available to business and scientific fields hasresulted in an unprecedented opportunity to develop automated data mining techniques forextracting useful knowledge from massive data sets. Large-scale data analysis problemsvery often also involve the investigation of relationships among attributes in heterogeneousdata sets where rules identified among the observed attributes in certain regions do notapply elsewhere. This problem may be further complicated by the fact that in many cases,the heterogeneous databases are located at multiple distributed sites. Therefore, the issuesof modern data mining include not just the size of the data to be mined but also its locationand homogeneity.

Page 84: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

204 LAZAREVIC AND OBRADOVIC

Data may be distributed across a set of sites or computers for several reasons. For exam-ple, several data sets concerning business information (e.g. telephone or credit card fraud)might be owned by separate organizations that have competitive reasons for keeping the dataprivate. In addition, these data may be physically dispersed over many different geographiclocations. However, business organizations may be interested in enhancing their own modelsby exchanging useful information about the data. Another need for learning from multiplesources could be found when datasets have grown too large to fit into the main computermemory. A parallel approach to building a model in such a situation is aimed at solving thepractical problem of how to learn from large data sets. For this purpose, various parallelizedmachine learning algorithms were proposed, e.g. parallel decision tree [26, 28, 29], parallelassociation rules [31] and parallel rule induction [23]. On the other hand, in order to solvethe problem of learning from very large and/or distributed databases, some researchers haveproposed incremental learning techniques. Usually these techniques involve direct modifi-cations of standard learning algorithms, such as decision trees [30] and rule learner [5].

Since distributed learning usually involves several data sets from multiple sites, an alter-native and fairly general method for distributed learning is to combine different multiplepredictors in a “black-box” manner. Different meta-learning techniques explored at theJAM project [4, 22] were proposed in order to coalesce the predictions of classifiers trainedfrom different partitions of the training set. The advantage of this approach is that it isalgorithm-independent, it can be used to scale up many learning algorithms, and it ensuresthe privacy of data at multiple sites.

In this paper, we propose a novel technique of combining classifiers from multiple sitesusing a boosting technique [9]. Boosting uses adaptive sampling of patterns to generatea highly accurate ensemble of many weak classifiers whose individual global accuracy isonly moderate. In boosting, the classifiers in the ensemble are trained serially, with theweights on the training instances adjusted adaptively according to the performance of theprevious classifiers. The main idea is that the classification algorithm should concentrateon the instances that are difficult to learn. Boosting has received extensive theoretical andempirical study [10, 19], but most of the published work focuses on improving the accuracyof a weak classifier over the same single, centralized data set that is small enough to fit intothe main memory. So far, there has not been much research on using the boosting techniquefor distributed and parallel learning. The only exception was boosting for scalable anddistributed learning [7], where each classifier was trained using only a small fraction ofthe training set. In this distributed version, the classifiers were trained either from randomsamples (r-sampling) or from disjoint partitions of the data set (d-sampling). In r-sampling,a fixed number of examples were randomly picked from the weighted training set (withoutreplacement), where all examples had equal chance of being selected. In d-sampling, theweighted training set was partitioned into a number of disjoint subsets, where the data fromeach site was taken as a d-sample. At each round, a different d-sample was given to the weaklearner. Both methods can be used for learning over very large data sets, but d-samplingis more appropriate for distributed learning, where data at multiple sites cannot be pulledtogether to a single site. The reported experimental results indicated that their distributedboosting is either comparable to or better than learning single classifiers over the completetraining set, but only in some cases comparable to boosting over the complete data set.

Page 85: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 205

Our objective was to develop a boosting technique applicable to both parallel and distrib-uted environments. The first boosting modification represents a “parallelized” version ofthe boosting algorithm for a tightly coupled shared memory system with a small numberof processors (e.g. a dual processor system). This method is applicable when all the datacan fit into the main memory, with the major goal of speeding up the boosting process. Ateach boosting round, all classifiers are trained on different samples drawn from the sametraining data set and then assigned to one of the data instances according to the confidenceof their prediction. Empirical studies of several data sets have demonstrated that this methodcan achieve the same or slightly better classification accuracy considerably faster than thestandard boosting algorithm.

The second boosting adaptation explored in our study is more suitable for distributedlearning, since it assumes that the disjoint data sets from multiple sites cannot be mergedtogether. However, it can also be applied to parallel learning, where the training data setis split into several sets that reside on different processors within a parallel computer. Thedata sets can be either homogeneous or heterogeneous, or can even contain different datadistributions. In the proposed method, at each boosting round the classifiers are first learnedon disjoint datasets and then exchanged amongst the sites. The exchanged classifiers arethen combined and their weighted voting ensemble is constructed on each disjoint data set.The final ensemble represents an ensemble of ensembles built on all local distributed sites.The performance of ensembles is used to update the probabilities of drawing data samplesin succeeding boosting iterations. Our experimental results indicate that this method iscomputationally effective and comparable to or even slightly better in achieved accuracythan when boosting is applied to the centralized data.

2. Methodology

The modifications of the boosting algorithm that we propose in this paper are variants ofthe AdaBoost.M2 procedure [9], shown in figure 1. The algorithm supports multi-class

Figure 1. The AdaBoost.M2 algorithm.

Page 86: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

206 LAZAREVIC AND OBRADOVIC

problems and proceeds in a series of T rounds. In every round, a weak learning algorithm iscalled and presented with a different distribution Dt that is altered by emphasizing particulartraining examples. The distribution is updated to give wrong classifications higher weightsthan correct classifications. The entire weighted training set is given to the weak learnerto compute the weak hypothesis ht . At the end, all weak hypotheses are combined into asingle hypothesis hfn.

The boosting algorithm may be appropriate for distributed learning for several reasons:it can be applied to a wide variety of algorithms, it is often superior to other combiningmethods and its weighted voting ensemble can easily scale the magnitudes of classifiersgiving a large weight to a strong hypothesis thus correcting wrong classifications of manyweaker hypotheses. In addition, a natural way of learning in a distributed environment isby combining classification predictors. Our objective, therefore, is to exploit all of theseadvantages in order to apply boosting to parallel and distributed learning.

As classifiers in all boosting experiments, we trained two-layer feedforward neural net-work (NN) models since such universal approximators were often reported to outperform thealternatives for classification of real life non-linear phenomena [12]. The NN classificationmodels had the number of hidden neurons equal to the number of input attributes. To useNNs with AdaBoost.M2 algorithm, our implementation had the number of output nodesequal to the number of classes, where the predicted class is from the output with the largestresponse. In such an implementation, the output nodes compose a set of “plausible” labels,thus directly satisfying requirement of AdaBoost.M2 algorithm that the values of the outputnodes indicate a “degree of plausibility” and all these plausibilities do not necessarily addup to 1. To optimize NN parameters we used resilient propagation [24] and Levenberg-Marquardt [11] learning algorithms.

Although there are known ways of combining NNs trained on different subsets in order toproduce a single learner (e.g. Breiman’s born again trees [3]) very often they do not provideas good accuracy as an ensemble of classifiers created using the boosting algorithm. Sinceour major objective was to improve the generalization capabilities of proposed methods,constructing a simple and more comprehensive model was not considered in this study.

2.1. Boosting for parallel learning

The idea of proposed parallel boosting is to speed up the learning process of the standardboosting. Given a tightly coupled shared memory system with a few processors, our goal isto train classifiers on each of the available processors, and achieve the maximal predictionaccuracy faster than when learning on a single processor. We assume there are k processorsin the system, and each of them has access to entire training data set. The proposed algorithmis shown in figure 2.

In the proposed method, the classifiers are constructed on each of k available processorsat each boosting round. Each classifier is trained on a different sample Q j,t drawn from thesame training set S according to the same distribution Dt . Instead of a single classifier builtat each boosting iteration, in parallel boosting there are k classifiers that compete for the dataexamples according to the confidence of their predictions. The classifier with the highestprediction confidence for some data instance is responsible for making the prediction on

Page 87: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 207

Figure 2. Parallel boosting algorithm.

that data example. Therefore, there is a different classifier responsible for each data pointinside the data set, while the hypothesis ht for boosting iteration t , is a mixture of weakhypotheses h j,t , j = 1, . . . , k. The distribution Dt is updated according to the performanceof the mixed hypothesis ht on the training set S, and it is used by each processor to drawsamples in subsequent boosting rounds. The composite hypothesis ht is also used whenmaking a final hypothesis hfn. It is very important to note that within the system only theclassifiers are moved, not the data examples themselves.

2.2. Boosting for distributed learning in homogeneous databases

2.2.1. The general framework of distributed boosting. The objective of a distributedlearning algorithm is to efficiently construct a prediction model using data at multiple sitessuch that prediction accuracy is similar to learning when all the data are centralized at asingle site. Towards such an objective, we propose several modifications of the boostingalgorithm within the general framework presented at figure 3. All distributed sites performthe learning procedure at the same time.

Assume there are k distributed sites, where site j contains data set Sj with m j examples,j = 1, . . . , k. Data sets Sj contain the same attributes and do not necessarily have thesame size. During the boosting rounds, site j maintains a local distribution �j,t and thelocal weights wj,t that directly reflect the prediction accuracy on that site. However, ourgoal is to emulate the global distribution Dt obtained through iterations when standardboosting is applied to a single data set obtained by merging all sets from distributed sites.In order to create such a distribution that will result in similar sampling as when all data are

Page 88: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

208 LAZAREVIC AND OBRADOVIC

Figure 3. The distributed boosting framework.

centralized, the weight vectors wj,t, j = 1, . . . , k from all distributed sites are merged intoa joint weight vector wt , such that the q-th interval of indices [

∑q−1p=1 mp + 1,

∑qp=1 mp] in

the weight vector wt corresponds to the weight vector wq,t from the q-th site. The weightvector wt is used to update the global distribution Dt (step 5, figure 1). However, merging allthe weight vectors wj,t requires a huge amount of time for broadcasting, since they directlydepend on the size of the distributed data sets. In order to reduce this transfer time, insteadof the entire weight vectors wj,t, only the sums Vj,t of all their elements are broadcast (step 9,figure 3). Since data site j samples only from set Sj , there is no need to know exact valuesof the elements in the weight vectors wq,t (q �= j, q = 1, . . . , k) from other distributed sites.Instead, it is sufficient to know only the number of data examples need to be sampled from

Page 89: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 209

the site q . Therefore, each site j creates a weight vector Uj,t (step 10, figure 3), whereits j-th interval [

∑ j−1p=1 mp + 1,

∑ jp=1 mp] represents the weight vector wj,t, while the all

other intervals that correspond to the weight vectors from other distributed sites may be setarbitrarily such that the values inside the q-th interval of indices (q �= j) sum to the valueVq,t . The simplest method to do this is to set all values in q-th interval to the value Vq,t/mq .Using this method, expensive broadcasting of the huge weight vectors is avoided, while stillpreserving the information which site is more difficult to learn and where more examplesneed to be sampled.

As a result, each site at round t maintains its version Dj,t of the global distribution Dt ,and its local distribution �j,t. At each site j , the samples in boosting rounds are drawnaccording to the distribution Dj,t, but the sampled training set Qj,t for site j is created onlyfrom those data points that match the indices drawn from the j-th interval in the distributionDj,t. It is evident that the samples Qj,t from distributed sites do not necessarily have thesame size through iterations, but the total number of examples drawn from all distributedsites is always the same. The motive for using “unbalanced” sampling from distributed sitesis to simulate drawing the same instances as in the standard boosting. At each boostinground t , the classifiers Lj,t are constructed on each of the samples Qj,t and then exchangedamong the distributed sites. However, if the sample Qj,t is not sufficiently large, the mostaccurate classifier Lj,p, (p < t) constructed so far is used. The minimum size of the samplethat may be used for training of classifiers represents the size of random data sample forwhich only a small, predefined accuracy loss is achieved when comparing to accuracyobtained by learning from entire training set. Since all sites contain a set of classifiers Lj,t,j = 1, . . . , k, the next steps involve creating an ensemble Ej,t by combining these classifiersand computing a composite hypothesis hj,t. The local weight vectors wj,t are updated ateach site j in order to give wrong classifications higher weights than correct classifications(step 8, figure 3) and then their sums Vj,t are broadcast to all distributed sites. Each site jupdates its local version Dj,t according to the created weight vector Uj,t. At the end, thecomposite hypotheses hj,t from different sites and different boosting iterations are combinedinto a final hypothesis hfn.

In order to simulate the boosting on centralized data, our intention was to draw more datainstances from the sites that are more difficult for learning. The weights wj,t computed in step8, directly reflect the prediction capability for each data point, thus satisfying our goal to sam-ple more examples from the sites that are more difficult to learn. In order to further emphasizesampling from the sites that are difficult for learning, we consider dividing the weights wj,t

by the factor accpj (p = 0, 1 or 2), such that the difference between the weights from two

sites is further increased. Here, acc j corresponds to the local accuracy on correspondingsite j , and the factor p indicates how much we like to increase the difference between theweights from different sites (larger value p results in larger difference between the weights).

2.2.2. The variants of distributed boosting. We explore several variants of the proposeddistributed boosting algorithm (figure 3). The variants differ in creating an ensemble Ej,t

obtained by combining the classifiers Lj,t (step 4).The simplest method for combining classifiers into an ensemble Ej,t is based on Simple

Majority Voting of Classifiers from All Sites. If the classifiers Ll,t , l = 1, . . . , k, from all sites

Page 90: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

210 LAZAREVIC AND OBRADOVIC

Figure 4. The confidence-based technique for weighted combining classifiers from distributed sites.

produce hypotheses hl, j,t on site j , then the hypothesis hj,t (step 5, figure 3) is computedas:

hj,t = 1

k

k∑l=1

hl, j,t (1)

More sophisticated techniques for distributed boosting consider weighted combinationsof classifiers. In Weighted Majority Voting of Classifiers from All Sites, the weights ul,j,t ofthe classifiers Ll,t from all sites are proportional to the accuracy they achieve on the localsite j . Therefore, if the classifiers Ll,t produce hypotheses hl,j,t on site j , then the hypothesishj,t can be computed as:

hj,t =∑k

l=1 ul,j,t · hl,j,t∑kl=1 ul,j,t

(2)

We also consider Confidence-based Weighting Classifiers from All Sites, where the clas-sifiers from all sites are combined using the procedure similar to boosting technique. If theclassifiers Ll,t at iteration t produce hypotheses hl,j,t on data set Sj from site j that maintainsthe distribution �j,t, then this technique of combining classifiers is defined at figure 4.

2.3. Boosting for distributed learning in heterogeneous databases

We consider two scenarios when learning from heterogeneous databases among the dis-tributed sites: (1) all heterogeneous databases with a similar mixture of distributions;(2) databases with different but homogeneous distributions. In both scenarios, all datasites have the same set of attributes.

2.3.1. Learning from heterogeneous databases with a similar mixture of distributions.Our previous research shows that in heterogeneous databases where several more homo-geneous regions exist, standard boosting does not enhance the prediction capabilities assignificantly as for homogeneous databases [14]. In such cases it is more useful to haveseveral local experts each with expertise in a small homogeneous region of the data set[15]. A possible approach to this problem is to cluster the data first and then to assign asingle specialized classifier to each discovered cluster. Therefore, we combine this boosting

Page 91: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 211

Figure 5. The distributed boosting specialized experts for sites with heterogeneous distributions.

specialized experts approach with already proposed distributed boosting in order to furtherimprove it. The general idea for boosting specialized experts in a distributed environmentis shown in figure 5.

Similar to learning from homogeneous distributed databases (Section 2.2), all k sitesagain maintain their own versions Dj,t of the distribution Dt , and the final hypothesis Hfn

represents the combination of hypotheses Hj,t from different sites and different boosting

Page 92: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

212 LAZAREVIC AND OBRADOVIC

Figure 6. The algorithm for merging specialized hypotheses hj,t,l into a composite hypothesis Hj,t .

iterations. However, in this scenario Hj,t is not the composite hypothesis that correspondsto a classifier ensemble created on the site j at iteration t , but it represents a compos-ite hypothesis obtained by combining classifier ensembles Ej,t,l constructed on c j clustersidentified on site j at iteration t . Due to the similar mixture of distributions, assume thatthe number of discovered clusters is the same on all sites. Ensembles Ej,t,l are constructedby combining classifiers Lj,t,l that are learned on clusters Qj,t,l, l = 1, . . . , c j , obtained byapplying clustering algorithm on the sample Qj,t. All classifiers Lj,t,l are then exchangedamong all sites, but only learners L j,t,q , q = 1, . . . , c j , that correspond to the q-th clusterQ j,t,q are combined to create an ensemble E j,t,q on each site j and to compute a correspond-ing hypothesis hj,t,l. Merging the hypotheses hj,t,l that correspond to ensembles Ej,t,l into acomposite hypothesis Hj,t is performed using the algorithm described in figure 6. Therefore,the final classifier corresponding to the final hypothesis Hfn is computed by combining theclassifiers from discovered clusters at different sites and different iterations.

In merging hypotheses hj,t,l, data points from different clusters Sj,t,l have different pseudo-loss values εj,t,l and different parameter values β j,t,l. For each cluster Sj,t,l, l = 1, . . . , c j ,from iteration t , defined by the convex hull CHj,t,l, there is a pseudo-loss εj,t,l and thecorresponding parameter βj,t,l (figure 6). Both the pseudo-loss value εj,t,l and parameterβj,t,l are computed independently for each cluster Sj,t,l where a particular classifier Lj,t,l

is responsible. Before updating distribution Dj,t, a unique vector βj,t is created such thatthe i-th position in the vector βj,t is equal to βj,t,l if the i-th pattern from the entire sets Sj

belongs to the cluster Sj,t,l identified at iteration t . Similar, the hypotheses hj,t,l are mergedinto a single hypothesis Hj,t. Since we merge βj,t,l into βj,t and hj,t,l into hj,t, updating thedistribution Dj,t can be performed in the same way as in the distributed boosting algorithm(Section 2.2).

Our distributed boosting algorithm for heterogeneous databases involves clustering atstep 3 (figure 5). Therefore, there is a need to find a small subset of attributes that uncover“natural” groupings (clusters) from the data according to some criterion. For this purposewe adopt the wrapper framework in unsupervised learning [6], where we apply the clus-tering algorithm to attribute subsets in the search space and then evaluate the subset by acriterion function that utilizes the clustering result. If there are d attributes, an exhaustivesearch of 2d possible attribute subsets to find one that maximizes our selection criterion is

Page 93: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 213

computationally intractable. Therefore, in our experiments, fast sequential forward selec-tion search is applied. Like in [6] we also accept the scatter separability trace for attributeselection criterion.

This procedure, performed at step 2 of every boosting iteration, results in r relevantattributes for clustering. Thus, for each round of boosting algorithm at each site j , thereare relevant attribute subsets that are responsible for distinguishing among homogeneousdistributions in the sample Qj,t. In order to find those homogeneous regions, clustering ateach boosting iteration is performed. Two clustering algorithms are employed: standardk-means algorithm and density based clustering algorithm, called DBSCAN [25], designedto discover clusters of arbitrary shape efficiently.

As a result of clustering, on each site j several distributions Dj,t,l (l = 1, . . . , c j ) areobtained, where c j is the number of discovered clusters at site j . For each of c j clustersdiscovered in the data sample Qj,t, a weak learner Lj,t,l is trained using the correspondingdata distribution Sj,t,l, and a weak hypothesis hj,t,l is computed. Furthermore, for everycluster Qj,t,l identified at the sample Qj,t, its convex hull CHj,t,l is identified in the attributespace used for clustering, and these convex hulls are applied to the entire training set in orderto find the corresponding clusters Sj,t,l (figure 7) [17]. All data points inside the convex hullCHj,t,l belong to the l-th cluster Sj,t,l discovered at iteration t on site j . Data points outsidethe identified convex hulls are attached to the cluster containing the closest data pattern.Therefore, instead of a single global classifier constructed in every boosting iteration, thereare c j classifiers Lj,t,l and each of them is applied to the corresponding cluster Sj,t,l.

When performing clustering during boosting iterations, it is possible that some of thediscovered clusters have insufficient size for training a specialized classifier. Hence, insteadof training a specialized classifier on such a cluster with an insufficient amount of data,classifiers from previous iterations that were constructed on the corresponding clustersdetected through the convex hull matching are consulted (figure 7) and one with the maximallocal prediction accuracy is employed.

Figure 7. Mapping convex hulls (CH1,1,l ) of clusters Q1,1,l discovered in data sample Q1,1 to the entire trainingset S1 in order to find corresponding clusters S1,1,l . For example, all data points inside the contours of convex hullCH1,1,1 (corresponding to cluster Q1,1,1 discovered on Q1,1) belong to cluster S1,1,1 identified on S1.

Page 94: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

214 LAZAREVIC AND OBRADOVIC

2.3.2. Learning from different homogeneous distributions. An alternative boosting al-gorithm for learning in a heterogeneous distributed environment is proposed, where k dis-tributed sites contain databases with the same attributes but with different homogeneousdistributions. An unseen test data set may belong to one of these distributions or may bea mixture of distributions from the multiple sites. Here, the method for distributed boost-ing from Section 2.2 will not provide satisfactory prediction accuracy, since the classifierslearned from one data distribution will have poor performance on a data set with a differ-ent distribution. Therefore, when making prediction there is a need to identify appropriateclassifiers and to determine a measure of this appropriateness.

When the test data set contains only one distribution, only classifiers that are constructedon data sets that stem from distributions similar to the distribution from the test set arecombined. For determining similar distributions, the difference between them is computedusing Mahalanobis distance [8], since our previous research indicated that it could be aneffective technique for distinguishing between two mixtures of distributions [21]. Givendata sets S1 and S2 from distributed sites 1 and 2, the Mahalanobis distance between themis computed as:

dmah =√(

µS1 − µS2

) · �−1 · (µS1 − µS2

)T, (3)

where µS1 and µS2 are mean vectors of the data sets S1 and S2 respectively, and � is thesample covariance matrix [8]:

� = (m1 − 1) · �1 + (m2 − 1) · �2

(m1 + m2 − 2), (4)

with �1 and �2 denoting covariance matrices of S1 and S2. The Mahalanobis distanceis computed without violating data privacy, since only the number of points (m j ), meanvectors (µSj ) and covariance matrices (� j ) are exchanged among the sites. Therefore, thedistributed boosting algorithm (figure 3) is applied only to those sites that have the mostsimilar distributions to the distribution from the test data set.

However, when the test data set contains a mixture of distributions from multiple sites,for each test point it is necessary to determine the originating distribution. For this purpose,given k data sets Sj , j = 1, . . . , k, the Mahalanobis distance is computed between a newinstance r and the distributions corresponding to each of the sets Sj :

dmah =√(

r − µSj

) · �−1Sj

· (r − µSj

)T, (5)

where µSj and �−1Sj

represent the mean vector and the covariance matrix of the data set Sj .The test data instances are classified into groups, such that all points inside one group areclosest to one of the distributions from the k distributed sites. An ensemble of classifiers oneach of distributed sites is constructed independently using the standard boosting approach.The classifier ensemble E j is applied to the test subset whose instances are closest to thedistribution of the data set Sj . In addition, when two or more distributed sites have thedistributions sufficiently similar to one of the groups from the test data set, the distributedboosting algorithm from Section 2.2 is used to learn from sites with similar distributions.

Page 95: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 215

3. Experimental results

Our experiments were performed on several data sets. When experimenting with paralleland distributed boosting in homogeneous environments, five data collections were used.The first one contained two synthetic spatial data sets with 6,561 instances generated usingour spatial data simulator [20] such that the distributions of generated data were approxi-mately Gaussian but samples exhibit spatial correlation similar to the distributions of reallife spatial data. One data set was used for training and another one for out of sample testing.Since random splitting for spatial domains is likely to result in overly optimistic estimatesof prediction error (due to spatial correlation in data [18]), the training data set was spa-tially partitioned into three disjoint data sets used for distributed learning, each with 2,187examples (figure 8). The obtained spatial data sets stemmed from similar homogeneousdistributions and had five continuous attributes and three equal size classes. The other fourdata collections were Covertype, Pen-based digits, Waveform and LED data sets from theUCI repository [2]. The Covertype data set, currently one of the largest databases in theUCI Database Repository, contains 581,012 examples with 54 attributes and 7 target classesrepresenting the forest cover type for 30 × 30 meter cells obtained from US Forest Service(USFS) Region 2 Resource Information System [1]. In Covertype data set, 40 attributes arebinary columns representing soil type, 4 attributes are binary columns representing wilder-ness area, and the remaining 10 are continuous topographical attributes. Since the trainingof neural network classifier would be very slow if using all 40 attributes representing asoil type variable, we transformed them into 7 new ordered attributes. These 7 attributeswere determined by computing the relative frequencies of each of 7 classes in each of 40soil types. Therefore, we used a 7-dimensional vector with values that could be consideredcontinuous and therefore more appropriate for use with neural networks. This resulted in atransformed data set with 21 attributes. The 149,982 data instances were used for training inparallel boosting, while the same instances but separated into 8 disjoint data sets were usedfor distributed learning. The remaining 431,032 data examples were used for out of sampletesting. For the Pen-based digit data set, containing 16 attributes and 10 classes, originaltraining data set with 7,494 instances was randomly split into 6 disjoint subsets used forlearning, each with 1,249 examples, while the data set with 3,498 instances was used forout of sample testing. For the Waveform set, 50,000 instances with 21 continuous attributesand three equally sized classes were generated. The generated data were split into 5 sets of10,000 examples each, where 4 of them were merged and used as training set for parallelboosting, while the same 4 but separated were used for distributed learning. The fifth data

Figure 8. Partitioning the spatial training data set into three disjoint subsets.

Page 96: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

216 LAZAREVIC AND OBRADOVIC

set was used as a test set. The LED data set was generated with 10,000 examples and 10classes, where 4 sets with 1,500 examples were used for training in parallel and distributedenvironment, and the set with 4,000 examples was used for testing.

For all proposed algorithms performed on all five data sets the reported classificationaccuracies were obtained by averaging over 10 trials of the boosting algorithm. Whenapplying boosting techniques on neural network classifiers, the best prediction accuracieswere achieved using the Levenberq-Marquardt learning algorithm.

3.1. Results for parallel boosting

When experimenting with the parallel boosting algorithm, neural network classifiers wereconstructed on two, three and four processors during boosting iterations, since our exper-imental results indicated that no significant differences in prediction accuracy were foundwhen more neural networks were involved. In order to examine how the performance ofparallel boosting depends on the size of the data used for learning, the size of the trainingdata set was varied. The results for synthetic spatial data set and for Covertype data set arepresented respectively at figure 9(a) and (b), while the summary of the results for all fivedata sets are reported in Table 1.

It is evident from the charts obtained for the synthetic spatial data sets (figure 9), thatthe parallel boosting method achieved slightly better prediction accuracy than the stan-dard boosting method in less number of boosting rounds. The reduction in the number of

Figure 9. Out of sample averaged classification accuracies for standard boosting applied on a single processorand for parallel boosting applied on 2, 3 and 4 processors. Both algorithms are applied on (a) synthetic spatial dataset (b) Covertype data set.

Page 97: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 217

Table 1. A comparative analysis of parallel boosting speedup for different number of processors.

Number of parallel boosting iterations needed for achieving the same maximalaccuracy as standard boosting trained on a single processor

Number ofprocessors Synthetic spatial Covertype Pen-based digits LED Waveform

1 100 20 12 10 8

2 28 11 6 7 6

3 14 9 5 6 5

4 9 8 4 5 5

boosting rounds required for achieving the maximal prediction accuracy was especially ap-parent when learning on larger data sets (figure 9(a1) and (b1)). However, when the size ofentire training data set decreased, the parallel boosting became less superior to the standardboosting method (figure 9 (a2), (a3), (b2) and (b3)). This phenomenon was probably due tothe overfitting of the neural networks constructed at boosting iterations, since the compositeclassifier computed through competing neural network classifiers was probably overfittedmore than a single neural network classifier. Although overfitting may be sometimes usefulwhen combining classifiers [27] (it increases the variance of the combined models andtherefore their diversity too), when small training data sets are available, diversity of theclassifiers could not be properly emphasized because the classifiers were constructed onthe samples drawn from the same training set but with insufficient number of points forachieving reasonable generalizability. Therefore, when the time for constructing a clas-sifier ensemble is an issue, the speed of achieving the maximal prediction accuracy maybe especially valuable when tightly coupled computer systems with a few processors areavailable.

Since in the proposed parallel boosting there was only one accessed data set, the scaleupproperties was not considered, but instead, we determined the speedup, i.e. the decreasein the number of iterations the parallel boosting needed for achieving the same accuracycomparing to standard boosting applied on a single processor. However, the parallel boostingdoes not provide speedup directly but indirectly, since each processor samples from the same“global” data set and several classifiers are computed per iteration instead of one. Resultsshown at Table 1 illustrate that the parallel boosting has very good speedup for syntheticspatial data, good for covertype and pen-based digit data sets, but fairly poor for LED andWaveform data sets probably due to their homogeneous distributions.

3.2. Results for distributed boosting in homogeneous environments

3.2.1. Time complexity analysis. We performed experiments on all five reported data setsand compared the computational time needed for training neural network (NN) classifiersusing the standard and distributed boosting approach. The major advantage of the proposeddistributed boosting algorithm is that it requires significantly less computational time pereach boosting round since the classifiers are learned on smaller data sets. Figure 10 shows

Page 98: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

218 LAZAREVIC AND OBRADOVIC

Fig

ure

10.

The

time

need

edfo

rle

arni

ngN

Ncl

assi

fiers

for

diff

eren

tsi

zes

offiv

edi

ffer

ent

data

sets

. (a)

Spat

ial

(d)

Cov

erty

peda

tase

t;(e

)Pe

n-ba

sed

digi

tdat

ase

t.sy

nthe

ticda

tase

t;(b

)L

ED

data

set;

(c)

Wav

efor

mda

tase

t;

Page 99: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 219

how the time required for constructing NNs depends of the number of examples in trainingset for all three reported data sets when measured on a Pentium III processor with 768MB of main memory. Analyzing the figure 10(a), it is evident that the time needed forconstructing a NN classifier on the three times reduced synthetic spatial training set resultedin more than three times faster computing time. Similarly, when constructing NNs on LEDand Waveform data sets, four times smaller data set caused more than four times fasterlearning (figure 10(b) and (c)). For Covertype data set, time needed for training a NNon an eight times smaller data set was more than eight times smaller than time requiredfor training a NN when using the entire training set (figure 10(d)). Finally, training a NNon a six times reduced Pen-based digit data set (figure 10(e)) resulted in 5.5 times fastertraining.

In order to estimate the speedup of the proposed distributed boosting algorithm, weneed to consider a communication overhead that involves time required for broadcast-ing the NN classifiers and the sums Vj,t of the weight vectors wj,t to all sites. The sizeof the NN classifiers is directly proportional to the number of input, hidden and out-put nodes, and is relatively small in practice. (e.g., our implementation of a two-layeredfeedforward NN with 5 input and 5 hidden nodes required only a few KB of memory).The broadcasting of such small classifiers results in very small communication overhead,and when the number of the distributed sites grows, time needed for broadcasting in-creases linearly. However, the true estimate of the communication overhead among thedistributed sites depends on the actual implementation of the communication amongstthem. Assuming that the communication overhead for small number of distributed sitesis negligible comparing to the time needed for training a NN classifier, the proposed dis-tributed boosting algorithm achieves a linear speedup (figure 11). The scale up is usuallymeasured when increasing the number of sites and keeping the number of data examplesper site constant. It is obvious that in such situation, time needed for training NN classi-fiers on distributed sites is always the same regardless of the number of sites. The onlyvariable component is the communication overhead that is negligible for small numberof sites (up to 10). Therefore it is apparent that the achieved scale up is also close tolinear.

Figure 11. The speedup of the distributed boosting algorithm for different data sets.

Page 100: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

220 LAZAREVIC AND OBRADOVIC

3.2.2. Prediction accuracy comparison. To explore whether our distributed boosting al-gorithm can reach similar prediction accuracy as the standard boosting algorithm on acentralized data set, experiments were first performed using simple majority and weightedmajority algorithms for learning from homogeneous databases (figure 12). In addition tocomparison to standard boosting on centralized data, we also compared distributed boost-ing algorithms to simpler algorithms for distributed learning. The first one was “distributedbagging” where we used voting over classifier ensembles constructed independently ondistributed sites using bagging procedure, while the second algorithm employed votingover classifier ensembles built separately on distributed sites using boosting method. Sincethe ensembles are constructed independently on each of distributed sites, the only com-munication includes exchanging the classifier ensembles built on each site at the end ofprocedure. However, voting over ensembles built using boosting method was consistently

Figure 12. Out of sample classification accuracies of different distributed boosting algorithms. (a) Syntheticspatial data set; (b) Covertype data set; (c) LED data set; (d) Waveform data set; (e) Pen-based digit data set.

Page 101: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 221

more accurate than “distributed bagging”, and for simplicity reasons only those results arereported.

For each of the graphs shown at figure 12, the results were achieved for p = 0, as inmost of the cases this modification was more accurate than if using p = 1 or p = 2 fordividing the weight vector wj,t by the factor accp. Similar to experiments with parallelboosting, changing the size of the data sets on distributed sites was used to investigate howthe performance of the distributed boosting algorithm varied with the number of data pointsused for learning (figure 12).

Results from experiments performed on the synthetic spatial data sets and Covertypedata sets indicate that the methods of simple and weighted majority voting of the classifiersconstructed on multiple sites were successful when learning from distributed sites, sincethey achieved approximately the same classification accuracies as the standard boostingalgorithm on merged data (figure 12(a) and (b)). It is also noticeable that for achievingthe maximal prediction accuracy the larger number of boosting iterations was needed forsmaller data sets than for larger ones. Figure 12(a) and (b) also demonstrate that the simpleand weighted majority algorithms were more successful than the voting over boostingensembles built independently on distributed sites. Finally, all distributed algorithms weremore accurate than the boosting method applied to a single distributed site (figure 12(a) and(b)), thus indicating that the data distributions on distributed sites were different enoughsince learning from a single site could not achieve the same generalizability as when learningfrom a centralized data set.

When performing the experiments on the LED, Waveform and Pen-based digit data sets(figure 12(c), (d) and (e)), simple and weighted majority distributed boosting algorithmswere consistently comparable in prediction accuracy to standard boosting on the centralizeddata. However, these majority algorithms also showed similar prediction accuracy to thevoting over classifier ensembles built using boosting method, and were only slightly moreaccurate then the boosting method applied to a single distributed site, probably due to highhomogeneity of data.

In addition, the effect of dividing the sampling weightswj,t by the factor accp, (p = 0, 1, 2)

was investigated for all three proposed distributed boosting methods. In general, in thepresence of sites that are significantly more difficult for learning than the others, a smallincrease in the sampling weights wj,t resulted in achieving the maximal prediction accuracyin a fewer number of boosting rounds. However, a larger accp factor (p = 2) could causedrawing insufficiently large samples from the sites that were easy to learn in later boostingiterations. As a consequence, the factor acc2(p = 2) could possibly result in method insta-bility and a drop in prediction accuracy. To alleviate this problem, the minimum size ofthe original data set, needed to be sampled from that site in order to cause only small andprespecified accuracy drop, is determined by empirical evaluation to be 15% of the originaldata set size. Otherwise, the best classifier built so far on a particular site was used whenmaking a classifier ensemble. In our experiments on synthetic spatial data (figure 13(a)),increasing the weights wj,t usually resulted in deteriorating the classification accuracy andin instability of the proposed method for smaller data sets (figure 13(a3)), while preservingmaximal prediction accuracy for experiments with large data sets (figure 13(a1)). The per-formed experiments on Covertype data sets showed similar behavior as the experiments on

Page 102: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

222 LAZAREVIC AND OBRADOVIC

Figure 13. Out of sample averaged classification accuracies of the simple majority boosting algorithm withdifferent weight modifications for distributed learning over (a) three synthetic spatial sites; (b) four sites containingLED data sets.

synthetic data set, while the experiments on LED (figure 13(b)), Waveform and Pen-baseddigit data sets showed similar prediction accuracy for all explored factors for updatingthe sampling weights wj,t (Table 2). This was probably due to homogeneous distribu-tions in these data sets, where there were no extremely difficult examples that need to beemphasized.

Finally, for distributed boosting in a homogeneous environment, we also performedexperiments using the confidence-based method of combining classifiers with all threemodifications for dividing the weights wj,t by the factor accp(p = 0, 1, 2) (figure 14).

Table 2. Final classification accuracies (%) for different distributed algorithms applied on four different datacollections when dividing the weights wj,t by the factor accp .

Method Data set Spatial Pen-based digit LED Waveform Covertype

Simplemajority

p = 0 82.7 96.5 73.4 87.0 72.6

p = 1 82.6 96.3 73.3 86.9 72.7

p = 2 82.2 96.1 73.1 86.8 72.5

Confidence-basedweighting

p = 0 84.3 97.1 73.4 87.2 73.1

p = 1 82.9 96.5 73.6 87.1 73.2

p = 2 82.1 96.1 73.4 87.1 73.0

Page 103: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 223

Figure 14. Out of sample averaged classification accuracies of confidence-based weighted combining classifiersin distributed boosting over (a) three synthetic spatial sites (b) four sites containing LED data sets.

The graphs in figure 14 show that the confidence-based combining classifiers achievedsimilar accuracy as the standard boosting applied on centralized data and the other meth-ods considered for distributed boosting. In addition, for some values of parameter p,the confidence-based distributed boosting slightly outperformed all other boosting meth-ods. The performed experiments on Covertype data sets using the confidence-based dis-tributed boosting showed similar effects as experiments on synthetic spatial data sets,while on the other hand, the experiments performed on Waveform and Pen-based digitdata sets demonstrated similar behavior to experiments carried out on LED data sets.Therefore, these results were not reported here. Unlike parallel boosting, the improve-ment in prediction accuracy was more significant when learning from smaller data sets,but instability was also more evident for smaller data sets (figure 14(a3) and (b3)). Theincrease in prediction accuracy with the decreasing the data sets was probably due tothe fact that the data sets on multiple sites were homogeneous, and more data pointswere needed in order to improve the generalizability of our models. When the numberof data instances decreased, there were not enough examples to learn data distribution ona single site, but the variety of data instances from multiple sites still helped in achiev-ing diversity of built classifiers. In the parallel boosting experiments (Section 3.1), theclassifiers were learned over the same training set, and therefore this diversity was notapparent.

Due to homogeneous distributions, the experiments performed on LED, Waveform andCovertype data sets again demonstrated the small observable difference in accuracy between

Page 104: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

224 LAZAREVIC AND OBRADOVIC

the standard boosting and all variants of confidence-based distributed boosting algorithmswhen p = 0, 1 and 2 (Table 2).

3.3. Results for distributed boosting specialized experts in heterogeneous environments

For distributed boosting with heterogeneous databases, due to difficulties finding appropriatereal life heterogeneous data sets, only synthetic spatial data sets generated through the spatialdata simulator [2] were used, where we were able to perform controlled experiments on life-like heterogeneous data of various complexity. In our experiments for distributed learningexplained in Section 2.3.1, all distributed sites had similar heterogeneous distributions withthe same sets of attributes. Four independently generated synthetic spatial data sets wereused each corresponding to five homogeneous data distributions made using our spatialdata simulator. The attributes f 4 and f 5 were simulated to form five distributions in theirattribute space ( f 4, f 5) using the technique of feature agglomeration [20]. Furthermore,instead of using a single model for generating the target attribute on the entire spatial dataset, a different data generation process using different relevant attributes was applied pereach distribution. The degree of relevance was also different for each distribution. All fourdata sets had 6561 patterns with five relevant ( f 1, . . . , f 5) and five irrelevant attributes( f 6, . . . , f 10), where the three data sets were used as the data from the distributed sites,and the fourth set was used as the test data set. The experiments for distributed boostinginvolving different homogeneous databases (Section 2.3.2) were performed on six syntheticspatial data sets each with a different distribution. Five of them were used for learning, andthe sixth was a test set.

For distributed boosting specialized experts (Section 2.3.1), experiments were performedwhen all data sites had similar but a heterogeneous distribution. In addition to comparisonto standard boosting and boosting specialized experts on centralized data, this distributedalgorithm was also compared to the mixture of experts [13] method, adapted for distributedlearning, where voting over mixture of experts from distributed sites is used to classify newinstances.

Figure 15 shows that both methods of boosting specialized experts in centralized anddistributed environments resulted in improved generalization (approximately 76–77% ascompared to 72–73% obtained through standard and distributed boosting). Furthermore,both methods of boosting specialized experts outperformed the “distributed” version ofmixture of experts, which achieved 75.2% ± 0.5% classification accuracy and also the mix-ture of experts method applied on centralized data (75.4% ± 0.4% classification accuracy).When comparing to standard and distributed boosting, it was also evident that the methodsof boosting specialized experts required significantly fewer iterations in order to reach themaximal prediction accuracy. After the prediction accuracy was maximized in our experi-ments, the overall prediction accuracy on a validation set, as well as the total classificationaccuracy on the test set, started to decline. The data set from one of the distributed sites usedfor learning served as a validation set. The phenomenon of deteriorating the classificationaccuracy was probably due to the fact that in the later iterations only data points that weredifficult for learning were drawn and therefore the size of some identified clusters duringthose iterations started to decrease thus causing a deficiency in the number of drawn data

Page 105: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 225

Figure 15. Out of sample averaged classification accuracies of distributed learning over three heterogeneoussites for synthetic spatial test data set (6561 instances) with 3-equal size classes and stemming from a similarmixture of 5 homogeneous distributions.

examples needed for successful learning. As a consequence, the prediction accuracy onthese clusters begun to decline and hence the total prediction accuracy decreased, too. It isinteresting to observe that the effect of accuracy drop is not noticed in the standard boosting,since the same number of examples is drawn in each iteration and only a single classifier isconstructed on these examples, thus avoiding to have insufficient number of examples forlearning.

A criterion for stopping the boosting algorithm early was to stop the procedure when theclassification accuracy on the validation set started to decline. However, after approximately20 additional boosting iterations the prediction accuracy was stabilized (figure 15). Althoughin practice the prediction accuracy on the test set does not necessarily start to drop in thesame iteration as for validation set, in our experiments this difference was usually withintwo to three boosting iterations and did not significantly affect the total generalizability ofthe proposed method. However, the thorough inspection of noticed phenomenon prevailsthe scope of this paper and requires experiments on more data sets.

Two groups of experiments were performed when learning from sites with different ho-mogeneous distributions and with the same set of attributes (Section 2.3.2). In the first groupof experiments, five data sets with different distributions were used for learning and a dataset with a distribution similar to the distribution from one of the existing multiple sites wasused for testing (figure 16). The second group of experiments was related to learning fromthe same five data sets, but testing on the same data set with five homogeneous distributionsas in the experiments when sites had similar heterogeneous distributions (figure 17).

The experimental results for distributed boosting in heterogeneous environment demon-strated that the method relying on computing Mahalanobis distance among the sites out-performed both the standard and alternative distributed boosting methods (figure 17). Theresult was a consequence of high heterogeneity in synthetic data sets. In such cases, theclassifiers constructed on the sites with distribution very different from the distribution onthe test data set only decreased accuracy of classifier ensembles.

Page 106: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

226 LAZAREVIC AND OBRADOVIC

Figure 16. Testing distributed boosting methods on data stemming from a distribution similar to one of distri-butions from 5 learning sites.

Figure 17. Testing distributed boosting methods on data stemming from a mixture of distributions from 5 learningsites.

4. Conclusion

A framework for parallel and distributed boosting is proposed. It is intended to efficientlylearn classifiers over large, distributed and possibly heterogeneous databases that cannot fitinto the computer main memory. Experimental results on several data sets indicate that the

Page 107: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 227

proposed boosting techniques can effectively achieve the same or even slightly better levelof prediction accuracy than standard boosting when applied to centralized data, while thecost of learning and memory requirements are considerably lower.

This paper raised several interesting questions that recently have gained a lot of attention.First, successful learning from very large and potentially distributed databases imposesmajor performance challenges for data mining, since learning a monolithic classifier can beprohibitively slow due to the requirement that all the data need to be held in the main memory.Second, many distributed data sets cannot be merged together due to a variety of practicalconstraints including data dispersed over many geographic locations, security services andcompetitive interests. Third, the prediction accuracy of employed data mining algorithmsis of fundamental impact for their successful application. Finally, the computational timerequired for constructing a prediction model is becoming more important as the amount ofavailable data is constantly growing.

The proposed boosting algorithms successfully overcome these concerns under a varietyof consideration, thus offering a fairly general method for effective and efficient learningin parallel and distributed environments.

A possible drawback of the proposed methods is that a large number of classifiers andtheir ensembles are constructed from available data sets. In such situation, the methods ofpost-pruning the classifiers [16] may be necessary to increase system throughput, while stillmaintaining the achieved prediction accuracy.

Although performed experiments have provided evidence that the proposed methods canbe successful for parallel and distributed learning, future work is needed to fully characterizethem especially in distributed environment with heterogeneous databases, where new algo-rithms for selectively combining classifiers from multiple sites with different distributionsare worth considering. It would also be interesting to examine the influence of the numberof distributed sites and their sizes to the achieved prediction accuracy and to establish asatisfactory trade off.

Finally, the proposed methods can be adapted for on-line learning when new data becomeavailable periodically and when it is computationally expensive to rebuild a single classifieror an ensemble on the entire data set.

Acknowledgments

The authors are grateful to Dragoljub Pokrajac for providing simulated data and for his helpin implementation design and to Celeste Brown for her useful comments.

References

1. J. Blackard, “Comparison of neural networks and discriminant analysis in predicting forest cover types,” Ph.D.dissertation, Colorado State University, 1998.

2. C.L. Blake and C.J. Merz, “UCI repository of machine learning databases,” http://www.ics.uci.edu/∼mlearn/MLRepository.html. Irvine, CA: University of California, Department of Information and ComputerScience, 1998.

3. L. Breiman and N. Shang, “Born again trees,” ftp://ftp.stat.berkeley.edu/pub/users/breiman/BAtrees.ps, 1996.

Page 108: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

228 LAZAREVIC AND OBRADOVIC

4. P. Chan and S. Stolfo, “On the accuracy of meta-learning for scalable data mining,” Journal of IntelligentIntegration of Information, L. Kerschberg (Ed.), 1998.

5. S. Clearwater, T. Cheng, H. Hirsh, and B. Buchanan, “Incremental batch learning.” in Proc. of the Sixth Int.Machine Learning Workshop, Ithaca, NY, 1989, pp. 366–370.

6. J. Dy and C. Brodley, “Feature subset selection and order identification for unsupervised learning,” in Proc.of the Seventeenth Int. Conf. on Machine Learning, Stanford, CA, 2000, pp. 247–254.

7. W. Fan, S. Stolfo, and J. Zhang, “The application of Adaboost for distributed, scalable and on-line learning,”in Proc. of the Fifth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Diego, CA,1999, pp. 362–366.

8. B. Flury, “A first course in multivariate statistics, Springer-Verlag: New York, NY, 1997.9. Y. Freund and R.E. Schapire, “Experiments with a new boosting algorithm,” in Proc. of the Thirteenth Int.

Conf. on Machine Learning, San Francisco, CA, 1996, pp. 325–332.10. J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” The

Annals of Statistics, vol. 38, no. 2, pp. 337–374, 2000.11. M. Hagan and M.B. Menhaj, “Training feedforward networks with the Marquardt algorithm,” IEEE Trans.

on Neural Networks, vol. 5, pp. 989–993, 1994.12. S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall: Englewood Cliffs, NJ, 1999.13. M. Jordan and R. Jacobs, “Hierarchical mixture of experts and the EM algorithm,” Neural Computation, vol.

6, no. 2, pp. 181–214, 1994.14. A. Lazarevic, T. Fiez, and Z. Obradovic, “Adaptive boosting for spatial functions with unstable driving

attributes,” in Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Kyoto, Japan, 2000, pp.329–340.

15. A. Lazarevic and Z. Obradovic. “Boosting localized classifiers in heterogeneous databases,” in Proc. on FirstInt. SIAM Conf. on Data Mining, Chicago, IL, 2001.

16. A. Lazarevic and Z. Obradovic, “The effective pruning of neural network ensembles,” in Proc. of IEEE Int.Joint Conf. on Neural Networks, Washington, D.C., 2001, pp. 796–801.

17. A. Lazarevic, D. Pokrajac, and Z. Obradovic, “Distributed clustering and local regression for knowledgediscovery in multiple spatial databases,” in Proc. 8th European Symp. on Art. Neural Networks, Bruges,Belgium, 2000, pp. 129–134.

18. A. Lazarevic, X. Xu, T. Fiez, and Z. Obradovic, “Clustering-Regression-Ordering steps for knowledge dis-covery in spatial databases,” in Proc. IEEE/INNS Int. Conf. on Neural Networks, Washington D.C., No. 345,Session 8.1B, 1999.

19. L. Mason, J. Baxter, P. Bartlett, and M. Frean. “Function gradient techniques for combining hypotheses,” inAdvances in Large Margin Classifiers, A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans (Eds.), MITPress: Cambridge, MA, 2000, chap. 12.

20. D. Pokrajac, T. Fiez, and Z. Obradovic, “A data generator for evaluating spatial issues in precision agriculture,”Precision Agriculture, in press.

21. D. Pokrajac, A. Lazarevic, V. Megalooikonomou, and Z. Obradovic, “Classification of brain image datausing measures of distributional distance,” in Proc. 7th Annual Meeting of the Organization for Human BrainMapping, London, UK, 2001.

22. A. Prodromidis, P. Chan, and S. Stolfo, “Meta-Learning in distributed data mining systems: Issues andapproaches,” in Advances in Distributed Data Mining, H. Kargupta and P. Chan (Eds.), AAAI Press: MenloPark, CA, 2000.

23. F. Provost and D. Hennesy, “Scaling Up: Distributed machine learning with cooperation,” in Proc. of theThirteenth National Conf. on Artificial Intelligence, Portland, OR, 1996, pp. 74–79.

24. M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The RPROPalgorithm,” in Proc. of the IEEE Int. Conf. on Neural Networks, San Francisco, CA, 1993, pp. 586–591.

25. J. Sander, M. Ester, H.-P. Kriegel, and X. Xu “Density-Based clustering in spatial databases: The algorithmGDBSCAN and its applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169–194, 1998.

26. J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A scalable parallel classifier for data mining,” in Proc. of the22nd Int. Conf. on Very Large Data Bases, Mumbai (Bombay), India, 1996, pp. 544–555.

27. P. Sollich and A. Krogh, “Learning with ensembles: How over-fitting can be useful.” Advances in NeuralInformation Processing Systems, vol. 8, pp. 190–196, 1996.

Page 109: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

BOOSTING ALGORITHMS FOR PARALLEL AND DISTRIBUTED LEARNING 229

28. M. Sreenivas, K. AlSabti, and S. Ranka, “Parallel out-of-core decision tree classifiers,” in Advances inDistributed Data Mining, H. Kargupta and P. Chan (Eds.), AAAI Press: Menlo Park, CA, 2000.

29. A. Srivastava, E. Han, V. Kumar, and V. Singh, “Parallel formulations of decision-tree classification algo-rithms,” Data Mining and Knowledge Discovery, vol. 3, no. 3, pp. 237–261, 1999.

30. P. Utgoff, “An improved algorithm for incremental induction of decision trees,” in Proc. of the Eleventh Int.Conf. on Machine Learning, New Brunswick, NJ, 1994, pp. 318–325.

31. M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithms for discovery of association rules,”Data Mining and Knowledge Discovery: An International Journal, special issue on Scalable High-PerformanceComputing, vol. 1, no. 4, pp. 343–373, 1997.

Page 110: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Page 111: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

DISTRIBUTED AND PARALLEL DATABASES

An International Journal

INSTRUCTIONS FOR AUTHORS

Authors are encouraged to submit high quality, original work that has neither appearedin, nor is under consideration by, other journals.

PROCESS FOR SUBMISSION

1. Authors should submit five hard copies of their final manuscript to:

Melissa AndersenDISTRIBUTED AND PARALLEL DATABASESEditorial OfficeKluwer Academic Publishers Tel.: 781-871-6600101 Philip Drive FAX: 781-871-6528Norwell, MA 02061 E-mail: [email protected]

For prompt attention, all correspondence can be directed to this address.2. Authors are strongly encouraged to use Kluwer’s LATEX journal style file. Please see

ELECTRONIC FORM section below.3. Enclose with each manuscript, on a separate page, from three to five keywords.4. Enclose originals for the illustrations, see “Style for Illustrations”, for one copy of the

manuscript. Photocopies of the figures may accompany the remaining copies of themanuscript. Alternatively, original illustrations may be submitted after the paper hasbeen accepted.

5. Enclose a separate page giving the preferred address of the contact author for correspon-dence and return of proofs. Please include a telephone number, fax number and e-mailaddress, if available.

6. The refereeing is done by anonymous reviewers.

Electronic Delivery

Please send only the electronic version (of ACCEPTED paper) via one of the methods listedbelow. Note, in the event of minor discrepancies between the electronic version and hardcopy, the electronic file will be used as the final version.

Via electronic mail1. Please e-mail electronic version to

[email protected]

2. Recommended formats for sending files via e-mail:

a. Binary files-uuencode or binhexb. Compressing files-compress, pkzip, or gzipc. Collecting files-tar

3. The e-mail message should include the author’s last name, the name of the journal towhich the paper has been accepted, and the type of file (e.g., LATEX or ASCII).

Page 112: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Via anonymous FTP

ftp: ftp.wkap.comcd: /incoming/production

Send e-mail to [email protected] to inform Kluwer electronic version is at this FTP site.

Via disk1. Label a 3.5 inch floppy disk with the operating system and word processing program

along with the authors’ names, manuscript title, and name of journal to which the paperhas been accepted.

2. Mail disk to

Kluwer Academic PublishersDesktop Department

101 Philip DriveAssinippi Park

Norwell, MA 02061Any questions about the above procedures please send e-mail to:

[email protected]

We hope that these electronic procedures will encourage the submission of manuscripts tothis journal as well as improve the publication schedule.

STYLE FOR MANUSCRIPT

1. Typeset, double or 1 1/2 space; use one side of sheet only (laser printed, typewritten andgood quality duplication acceptable). Our LATEX style file offers a draft mode for thispurpose.

2. Use an informative title for the paper and include an abstract of 100 to 250 words at thehead of the manuscript. The abstract should be a carefully worded description of theproblem addressed, the key ideas introduced, and the results. Abstracts will be printedwith the article.

3. Provide a separate double-spaced sheet listing all footnotes, beginning with “Affilia-tion of author” and continuing with numbered footnotes. Acknowledgment of financialsupport may be given if appropriate.

4. References should appear in a separate bibliography at the end of the paper doublespaced with items referred to by numerals and put in alphabetical order. References inthe text should be denoted by numbers in square brackets, e.g. [12]. References shouldbe complete, in the following style:

Style for papers : Author(s) initials followed by last name for each author, paper title,publication name, volume, inclusive page numbers, month and year.Style for books : Author(s), title, publisher, location, year, chapter or page numbers (ifdesired).

Examples as follows:

(Book) D. Marr, Vision, A Computational Investigation into the Human Representation& Processing of Visual Information, Freeman: San Francisco, CA, 1982.(Journal Article) A. Rosenfeld and M. Thurston, “Edge and curve detection for visualscene analysis,” IEEE Trans. Comput., vol. C-20, pp. 562–569, 1971.(Conference Proceedings) A. Witkin, “Scales space filtering,” in Proc. Int. Joint Conf.Artif. Intell., Karlsruhe, West Germany, 1983, pp. 1019–1021.

Page 113: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

(Lab. memo.) A.L. Yuille and T. Poggio, “Scaling theorems for zero crossings,” M.I.T.Artif. Intell. Lab., Massachusetts Inst. Technol., Cambridge, MA, A.I. Memo. 722,1983.

5. Type or mark mathematical expressions exactly as they should appear in print. Journalstyle for letter symbols is as follows: variables, italic type (indicated by underline);constants, roman text type; matrices and vectors, boldface type (indicated by wavyunderline). In word-processor manuscripts, use appropriate typeface. It will be assumedthat letters in displayed equations are to be set in italic type unless you mark themotherwise. All letter symbols in text discussion must be marked if they should be italicor boldface. Indicate best breaks for equations in case they will not fit on one line.

STYLE FOR ILLUSTRATIONS

1. Originals for illustrations should be sharp, noise-free, and of good contrast. We regretthat we cannot provide drafting or art service.

2. Line drawings should be in laser printer output or in India ink on paper, or board.Use 8 1/2 × 11-inch (22 × 29 cm) size sheets if possible, to simplify handling of themanuscript.

3. Each figure should be mentioned in the text and numbered consecutively using Arabicnumerals. Specify the desired location of each figure in the text, but place the figureitself on a separate page following the text.

4. Number each table consecutively using Arabic numerals. Please label any material thatcan be typeset as a table, reserving the term “figure” for material that has been drawn.Specify the desired location of each table in the text, but place the table itself on aseparate page following the text. Type a brief title above each table.

5. All lettering should be large enough to permit legible reduction.6. Photographs should be glossy prints, of good contrast and gradation, and any reasonable

size.7. Number each original on the back.8. Provide a separate sheet listing all figure captions, in proper style for the typesetter, e.g.,

“Fig. 3. Examples of the fault coverage of random vectors in (a) combinational and(b) sequential circuits.”

PROOFING

Page proofs for articles to be included in a journal issue will be sent to the contact authorfor proofing, unless otherwise informed. The proofread copy should be received back bythe Publisher within 72 hours.

COPYRIGHT

It is the policy of Kluwer Academic Publishers to own the copyright of all contributions itpublishes. To comply with the U.S. Copyright Law, authors are required to sign a copyrighttransfer form before publication. This form returns to authors and their employers fullrights to reuse their material for their own purposes. Authors must submit a signed copy ofthis form with their manuscript.

REPRINTS

Each group of authors will be entitled to 25 free reprints of their paper.

Page 114: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Page 115: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS

Page 116: DISTRIBUTED AND PARALLEL DATABASESftp.cse.buffalo.edu/users/azhang/disc/DAPD/DAPD-11-2.pdfResources on parallel and distributed KDD There is a wealth of resources available for further

KLU

WER A

CADEMIC

PUBLI

SHERS