on the parallel complexity of hierarchical clustering and cc-complete problems dr. raymond greenlaw...

On the Parallel Complexity of Hierarchical Clustering and

CC-Complete Problems

Dr. Raymond GreenlawSchool of Computing

Armstrong Atlantic State Universityand

Dr. Sanpawat KantabutraDepartment of Computer Science

Chiang Mai University

On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems — Greenlaw and Kantabutra — 2

Outline• Introduction• Preliminaries• Algorithms for Hierarchical Clustering• Complexity of Hierarchical Clustering• CC-Complete Problems• Conclusions and Open Problems• References• Acknowledgments


Introduction• Clustering is a division of data into

groups of ‘similar’ objects, where each group is given a more-compact representation.

• Used to model very large data sets.

• Points are more similar to their own cluster than to points in other clusters.


Introduction• Useful tool in data mining, where

immense data sets which are difficult to store and to manipulate are involved.

• Study the parallel complexity of the hierarchical clustering problem.

• Builds a tree of clusters.• Sibling clusters in this tree partition the

points associated with their parent.• Can explore data using various levels of

granularity.


Introduction• Two widely studied models

– Bottom-Up Starts with single-point clusters and then recursively merges two or more of the most-‘appropriate’ clusters.

– Top-Down Starts with one large cluster consisting of all the data points and then recursively splits the most-‘appropriate’ cluster.

• In both methods, the process continues until a desired stopping condition is met such as a required number of clusters or a diameter bound of the ‘largest’ cluster.


Introduction• A variety of sequential versions of

hierarchical-clustering methods have been studied:– Cure Guha, et al.: Bottom-Up, good for

clusters having arbitrary shapes or outliers– Chameleon Karypis et al.: Bottom-Up,

relies heavily on graph partitioning– Principal Direction Divisive Partitioning

Boley: Top-Down, good for document collections

– Hierarchical Divisive Bisecting k-means Steinbach: Top-Down


Introduction• Address the parallel complexity of

hierarchical clustering.• Describe known sequential algorithms for

top-down and bottom-up hierarchical clustering.

• Parallelize top-down, when n points are to be clustered, provide an O(log n)-time, n2-processor CREW-PRAM algorithm that computes the same output as the corresponding sequential algorithm.


Introduction• Define a natural decision problem based

on bottom-up hierarchical clustering and add this Hierarchical Clustering Problem (HCP) to the list of CC-complete problems, adding a data mining problem for the first time.

• Show that HCP is one of the computationally most-difficult problems in the Comparator Circuit Value Problem (CCVP) class.


Introduction• Demonstrate that the HCP is very

unlikely to have an NC algorithm.• In sharp contrast, give an NC algorithm

for the top-down sequential approach.• Parallel complexities of top-down and

bottom-up are different, unless CC equals NC.


Preliminaries• Interested in relating the complexity of

hierarchical clustering to that of a problem involving Boolean circuits containing comparator gates.

• Comparator gates have two output wires, the first outputting the minimum and the second outputting the maximum of its two inputs.

• Each output has a maximum fanout of one.


Preliminaries• Based on the comparator gate• Basis for an entire complexity class

Comparator Circuit Value Problem (CCVP)• Given: An encoding of a Boolean

circuit composed of comparator gates, inputs x1,…,xn, and a designated output y.

• Problem: Is output y of TRUE on input x1,…,xn?


Preliminaries• Let P denote the class of all languages

decidable in polynomial time.• Let NC denote the class of all languages

decidable in poly-logarithmic time using a polynomial number of processors on a PRAM.

• Let RNC denote the randomized version of NC.• Let NLOG denote the class non-deterministic

logarithmic space.• Let CC denote the class of problems that are

NC many-one reducible to CCVP.


Outline• Introduction• Preliminaries• Algorithms for Hierarchical

Clustering• Complexity of Hierarchical Clustering• CC-Complete Problems• Conclusions and Open Problems• References• Acknowledgments


Algorithms for Hierarchical Clustering

Sequential Algorithms – Bottom-Up• Input: set of points, distance function,

bound B, and desired number of clusters, k• Output: set of clusters• Pair up all points starting with the two

closest ones, then the next remaining two closest ones, and so on, until all are paired.

• Next, the sets of points X and Y minimizing dmin(X,Y) over all remaining sets are merged, until only k sets remain.



Sequential Algorithms – Bottom-Up (cont.)• Assumed that the number of input

points is even.• There are no restrictions placed on the

distance function.• In the first phase of the algorithm points

are clustered whose distance is less than or equal to B.

• Operates in polynomial time.



Sequential Algorithms – Top-Down• Function v(G) takes a graph as its

argument and returns a set that consists of the vertices from G.

• Input: set of points, a distance function, and the desired number of clusters k

• Output: set of clusters• All points start in the same cluster.



• Compute a minimum-cost spanning tree.

• Form clusters by repeatedly removing the highest-cost edge from what remains of a minimum-cost spanning tree of the graph corresponding to the initial set of points with respect to the distance function, until exactly k sets have been formed.

• Runs in polynomial time.



• Top-Down and Bottom-Up have different parallel complexities, unless CC equals NC.

• Prove that the exact same clusters as produced by the Sequential (Top-Down) Hierarchical Clustering Algorithm can be computed in NC.

• A natural decision problem based on the Sequential (Bottom-Up) Hierarchical Clustering Algorithm is CC-complete.



• Since a CC-complete problem is very unlikely to have an NC algorithm and a problem with an NC algorithm is very unlikely to be CC-complete, the parallel complexities of these two sequential algorithms are different.

• For a fast parallel algorithm for hierarchical clustering, the algorithm should be based on the Top-Down approach.



• Theorem: Let n denote the number of points to be clustered. The Parallel (Top-Down) Hierarchical Clustering Algorithm can be implemented in O(log n) time using n2 processors on the CREW PRAM.

• This algorithm is an NC algorithm, which means that the clusters can be computed very fast in parallel.

• Any reasonable decision problem based on this algorithm will be in NC.


Outline• Introduction• Preliminaries• Algorithms for Hierarchical Clustering• Complexity of Hierarchical

Clustering• CC-Complete Problems• Conclusions and Open Problems• References• Acknowledgments


Complexity of Hierarchical Clustering

Hierarchical Clustering Problem (HCP)• Given: A set S of n points in Rd, a

distance function dS : S x S N, the number of clusters k ≤ n/2 N, a distance bound B, and two points x, y S.

• Problem: Are x and y with dS(x, y) ≤ B in the same cluster C after the first-phase of the Sequential (Bottom-Up) Hierarchical Clustering Algorithm?



• No restrictions placed on the properties the distance function must satisfy, the distances themselves must be natural numbers.

• This version of the problem easily reduces to the version where the weights come from R+.

• Not concerned with the distance between a point and itself, the k is the number of clusters to be formed.

• x and y are required to be no further apart than the distance bound B.



Lexicographically First Maximal Matching Problem (LFMMP)

• Given: An undirected graph G = (V, E) with an ordering on its edges plus a distinguished edge e E.

• Problem: Is e in the lexicographically first maximal matching of G?

• A matching is maximal if it cannot be extended.



• LFMMP is CC-complete [Cook 1982, Mayr and Subramanian 1992].

• Theorem: The Hierarchical Clustering Problem is NC many-one reducible to the Lexicographically First Maximal Matching Problem, that is, HCP ≤ LFMMP.

• HCP is in CC.

mNC



• Theorem: The Lexicographically First Maximal Matching Problem is NC many-one reducible to the Hierarchical Clustering Problem, that is, LFMMP ≤ HCP.

• Proof Sketch: Let G = (V = {1,…,n},E), ø : E {1,…,|E|} be an ordering on E, and e = {u,v} E be an instance of the LFMMP. Construct instance of HCP, a set S of n points p1,…,pn in Rd, a distance function dS : S x S N, clusters k ≤ n/2 N, bound B, and x,y S.

mNC



• Proof (cont.): Let S = {1,…,n,n+1,…,2n}. Let V’ = S – V. Define the distance function between each pair of points in S as follows:

• Let B = |E|, k = n, and take u and v as our points

dS(x,y) = ø({x,y})

if {x,y} E

= 2|E| if x V and y V’ or vice versa

= 3|E| if x V’, y V’, and x ≠ y

= 4|E| if x ≤ n, y ≤ n, x ≠ y, and

{x,y} E



• Theorem: The Hierarchical Clustering Problem is CC-complete.

• This expands the list of CC-complete problems and adds the first clustering/data mining problem to the class.


CC-Complete ProblemsComparator Circuit Value Problem (CCVP)• Given: An encoding of a Boolean circuit

composed of comparator gates, inputs x1,…,xn, and a designated output y.

• Problem: Is output y of TRUE on input x1,…,xn?

• References: [Cook 1982, Mayr and Subramanian 1992]


CC-Complete ProblemsLexicographically First Maximal Matching

Problem (LFMMP)• Given: An undirected graph G = (V, E) with

an ordering on its edges plus a distinguished edge e E.

• Problem: Is e in the lexicographically first maximal matching of G?

• References: [Cook 1982, Mayr and Subramanian 1992]

• Remarks: Resembles the Lexicographically First Maximal Independent Set Problem which is P-complete.


CC-Complete ProblemsStable Marriage Problem (SMP)• Given: A set of n men and a set of n

women. For each person a ranking of the opposite sex according to their preference for a marriage partner.

• Problem: Does the given instance of the problem have a set of marriages that is stable? The set is stable if there is no unmatched pair {m, w} such that both m and w prefer each other to their current partners.


CC-Complete ProblemsStable Marriage Problem (SMP)• References: [Mayr and Subramanian

1992, Subramanian 1989]• Remarks: If the preference lists are

complete, there is always a solution. Several variations of the SMP are also known to be equivalent to the CCVP. The Male-Optimal Stable Marriage Problem finds a matching in which no man could do any better in a stable marriage.


CC-Complete ProblemsStable Marriage Stable Pair Problem (SMSPP)• Given: A set of n men and n women, for each

person a ranking of the opposite sex according to their preference for a marriage partner, and a designated couple Alice and Bob.

• Problem: Are Alice and Bob a stable pair for the given instance of the problem? That is, is it the case that Alice and Bob are married to each other in some stable marriage?

• References: [Mayr and Subramanian 1992, Subramanian 1989]


CC-Complete ProblemsStable Marriage Minimum Regret Problem

(SMMRP)• Given: A set of n men and n women, for

each person a ranking of the opposite sex according to their preference for a marriage partner, and a natural number k, 1 ≤ k ≤ n.

• Problem: Is there a stable marriage in which every person has regret at most k? The regret of a person in a stable marriage is the position of her mate on her preference list.


CC-Complete ProblemsStable Marriage Minimum Regret Problem

(SMMRP)• References: [Mayr and Subramanian

1992, Subramanian 1989]• Remarks: The goal in this problem is to

minimize the maximum regret of any person.


CC-Complete ProblemsTelephone Connection Problem (TCP)• Given: A telephone line with a fixed

channel capacity k, a natural number l, and a sequence of calls (s1, f1),…, (sn, fn), where si (fi) denotes the starting (respectively, finishing) time of the i-th call. The i-th call can be serviced at time si if the number of calls being served at that time is less than k. If the call cannot be served, it is discarded. When a call is completed, the channel is freed up.


CC-Complete ProblemsTelephone Connection Problem (TCP)• Problem: Is the l-th call serviced?• References: [Ramachandran and Wang

1991]• Remarks: There is an O(min( ,k) log

n)-time EREW-PRAM algorithm that uses n processors for solving the TCP.

n


CC-Complete ProblemsInternal Diffusion Limited Aggregation

Predication Problem (IDLAPP)• Given: A time T and a list of moves

(t,i,s), one for each time 0 ≤ t ≤ T indicating that at time t for particle i, if still active, will visit site s, plus a designated site d, and a designated particle p. A particle is active if it is still moving within the cluster, that is, the particle has not yet stuck to the cluster because all of the sites that it has visited so far were occupied already.



Predication Problem (IDLAPP)• Problem: Is site d occupied and is site p

active at time T?• References: [Moore and Machta 2000]



Predication Square Lattice Problem• Given: A time T and a list of moves

(t,i,s) on a square lattice, one for each time 0 ≤ t ≤ T indicating that at time t for particle i, if still active, will visit site s, plus a designated site d, and a designated particle p.

• Problem: Is site d on the square latice occupied and is site p active at time T?

• References: [Moore and Matcha 2000]


CC-Complete ProblemsHierarchical Clustering Problem (HCP)• Given: A set S of n points in Rd, a distance

function dS : S x S N, the number of clusters k ≤ n/2 N, a distance bound B, and two points x, y S.

• Problem: Are x and y with dS(x, y) ≤ B in the same cluster C after the first-phase of the Sequential (Bottom-Up) Hierarchical Clustering Algorithm?

• Reference: [This work 2006]


Conclusions• A natural decision problem based on

bottom-up hierarchical clustering is CC-complete.

• Top-down hierarchical clustering is in NC.• Brings the number of known CC-complete

problems to ten, and shows that the HCP is unlikely to have a NC algorithm.

• Fast parallel algorithms for hierarchical clustering should be based on a top-down approach.


Open Problems• Is Euclidean HCP CC-complete? (It is in

CC.)• Determine the complexity of the

second-phase of the Sequential (Bottom-Up) Hierarchical Clustering Algorithm.

• Add new problems to the class of CC-complete problems.


References• [Blumenthal 1953] Theory and Applications of Distance Geometry.

Oxford University Press.• [Boley 1998] Principal direction divisive partitioning. Data Mining and

Knowledge Discovery, 2(4):325—344.• [Chong, Han, and Lam 2001] Concurrent threads and optimal parallel

minimum spanning tree algorithms. Journal of the ACM, 48(2):297—323.

• [Cole 1988] Parallel Merge Sort. SIAM Journal of Computing, 17(4):770—785.

• [Cook 1985] A taxonomy of problems with fast parallel algorithms. Information and Control, 64(1—3):2—22.

• [Dash, Petrutiu, and Scheuermann 2004] Efficient parallel hierarchical clustering. Lecture Notes in Computer Science, 3149:363—371.

• [Feder 1992] A new fixed point approach to stable networks and stable marriages. Journal of Computer and System Sciences, 45(2):233—284.

• [Gibbons 1985] Algorithmic Graph Theory. Cambridge University Press.

• [Greenlaw 1992] A model classifying algorithms as inherently sequential with applications to graph searching. Information and Computing, 97(2):133—149.


References• [Greenlaw, Hoover, and Ruzzo 1995] Limits to Parallel

Computation: P-Completeness Theory. Oxford University Press.• [Guha, Rastogi, and Shim 1998] Cure: An efficient clustering

algorithm for large databases. In ACM SIGMOD, pages 378—385, Seattle, WA. Association for Computing Machinery.

• [Jain and Dubes 1988] Algorithms for Clustering Data. Prentice-Hall.

• [Karypis, Han, and Vumar 1999] CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68—75.

• [Kaufman and Rousseeuw 1990] Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons.

• [Li 1990] Parallel algorithms for hierarchical clustering and clustering validity. IEEE Trans. Pattern Analysis and Machine Intelligence, 12(11):1088—1092.

• [Li and Fang 1989] Parallel clustering algorithms. Parallel Computing, 11(3):275—290.

• [Mayr and Subramanian 1992] The complexity of circuit value and network stability. Journal of Computer and System Sciences, 44(2):302—323.


References• [Moore and Machta 2000] Internal diffusion-limited aggregation:

Parallel algorithms and complexity. Journal of Statistical Physics, 99(3/4):661—690.

• [Olson 1995] Parallel algorithms for hierarchical clustering. Parallel Computing, 21(8):1313—1325.

• [Pólya, Tarjan, and Woods 1983] Notes on Introductory Combinatorics. Birkhäuser, Boston.

• [Rajasekaran 2005] Efficient parallel hierarchical clustering algorithms. IEEE Transactions on Parallel and Distributed Systems, 16(6):497—502.

• [Ramachandran and Wang 1991] Parallel algorithms and complexity results for telephone link simulation. In Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing, pages 378—375, Dallas, TX, December. IEEE.

• [Reif (ed) 1993] Synthesis of Parallel Algorithms. Morgan Kaufmann.• [Sairam, Vitter, and Tamassia 1993] A complexity theoretic

approach to incremental computation. In Finkel, Enjalbert, and Wagner, editors, STACS 93: 10th Annual Symposium on Theoretical Aspects of Computer Science, volume 665 of Lecture Notes in Computer Sciences, pages 640—649, Wurzburg, Germany, Fbruary. Springer-Verlag.


References• [Steinbach, Karypis, and Kumar 2000] A comparison of

document clustering techniques. In 6th ACM SIGKDD World Text Mining Conference, Boston, MA. Association for Computing Machinery.

• [Subramanian 1989] A new approach to stable matching problems. Technical Report STAN-CS-90-1275. Stanford University, Department of Computer Science.

• [Subramanian 1990] The Computational Complexity of the Circuit Value and Network Stability Problems, PhD thesis, Stanford University. Depatment of Computer Science Technical Report, STAN-CS-90-1311.

• [Tsai, Horng, Lee, Tsai, and Kao 1997] Parallel hierarchical clustering algorithms on processor arrays with a reconfigurable bus system. Pattern Recognition, 30(5):801—815.

• [Wu, Horng, and Tsai 2000] Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. Journal of Parallel and Distributed Computing, 60(9):1137—1153.


Acknowledgements• Computer Science Department at

Chiang Mai University, Thailand• Fulbright Commissions of Thailand and

the United States• Jim Hoover and Larry Ruzzo for material

from [Greenlaw, Hoover, and Ruzzo 1995]

on the parallel complexity of hierarchical clustering and cc-complete problems dr. raymond greenlaw...

Documents

introduction clustering

cccomplete problems

clustering methods

difficult problems

list of cc

parallel complexities

cc equals nc

data points