datamining_3 clustering methods

Datamining_3Clustering MethodsClustering a set is partitioning that set.Partitioning is subdividing into subsets which mutually exclusive (don't overlap) and collectively exhaustive (contain everything), such that each point is:similar to (close to) the other points in its component anddissimilar to (far from) the points in the other components.

A Categorization of Major Clustering Methods as they are known in the literature:Partitioning methods (k-means, k-medoids...)Hierarchical methods (Aglomerative (AGNES), Divisive (DIANA) ...)Density-based methodsGrid-based methodsModel-based methods

The k-Means Clustering MethodGiven k, the k-means algorithm is implemented in 4 steps (assumes partitioning criteria is: maximize intra-cluster similarity and minimize inter-cluster similarity. Of course, a heuristic is used. The method isnt really an optimization)

Partition into k subsets (or pick k initial means).

Compute the mean (center) or centroid of each cluster of the current partition (if one started with k means initially, then this step is done).a centroid ~= a point that minimizes the sum of dissimilarities from the mean or the sum of the square errors from the mean. Assign each object to the cluster with the most similar (closest) center.

Go back to Step 2 (recompute the new centroids of the new clusters).

Stop when the new set of means doesnt change much (or some other stopping condition?).

k-Means Clustering annimatedcentroids are red, set points are blue012345678910012345678910Step 1: assign each point to the closest centroid.Step 2: recalculate centroids.Step 3: re-assign each point to closest centroid.Step 4: repeat 2 and 3 until Stop_Condition=trueWhat are the Strengths of k-Means Clustering? It is relatively efficient: O(tkn),n is # objects, k is # clusters t is # iterations. Normally, k, t

The K-Medoids Clustering MethodFind representative objects, called medoids, (which must be an actual objects from the set, where as the means seldom are points in the set itself).

PAM (Partitioning Around Medoids, 1987)Choose an initial set of k medoids.Iteratively replace one of the medoids by a non-medoid.If it improves the aggregate similarity measure, retain the replacement. Do this over all medoid-nonmedoid pairs.PAM works for small data sets, but it does not scale well to large data sets.Later Modifications of PAM:CLARA (Clustering LARge Applications) (Kaufmann,Rousseeuw, 1990) Sub-samples then applies PAM.CLARANS (Clustering Large Applications based on RANdom Search) (Ng & Han, 1994): Randomized the sampling of CLARA.

Hierarchical Clustering Methods: AGNES (Agglomerative Nesting)Introduced in Kaufmann and Rousseeuw (1990)Uses the Single-Link set distance (distance between two sets the minimum pairwise distance).Other options are complete link (distance is maximum pairwise distance); average link...Starting with each point being a cluster component of its own, itteratively merge the two clusters that are most similarity. Retain each new clustering in a hierarchy.Eventually all nodes belong to the same cluster at the top or root node of this hierarchy or tree.

DIANA (Divisive Analysis)Introduced in Kaufmann and Rousseeuw (1990)Reverse AGNES (initially all objects are in one cluster; then itteratively split cluster components into two components according to some criteria (e.g., maximize some aggregate measure of pairwise dissimilarity again)Eventually each node forms a cluster on its own

Contrasting DIANA and AGNEStop down (divisively).bottom up (agglomerative)

Another look at Hierarchical Clustering

Another look at Hierarchical ClusteringIn either case, one gets a nice dendogram in which any maximal anti-chain (no 2 nodes are linked) is a clustering.

Hierarchical Clustering (Cont.)Any maximal anti-chain (maximal set of nodes in which no 2 are directly connected) is a clustering (a dendogram offers many).

Hierarchical Clustering (Cont.)But the horizontal anti-chains are the clusterings resulting from thetop down (or bottom up) method(s).

Data Mining SummaryData Mining on a given table of data includes

Association Rule Mining (ARM) on Bipartite Relationships

ClusteringPartitioning methods (K-means | K-medoids...), Hierarchical methods (Agnes, Diana...),Model-based methods (K-Means, K-Medoids..), ....

Classification Decision Tree Induction, Bayesian, Neural Network, k-Nearest-Neighbor,...)But most data mining is on a database, not just one table,that is, often times, first one must apply the appropriateSQL query to a database to get the table to be data mined.The next slides discuss vertical data methods for doing that.You may wish to skip this material if not interested in the topic.

Review of PtreesPredicate-tree: For a predicate on the leaf-nodes of a partition-tree (also induces predicates on i-nodes using quantifiers) Predicate-tree nodes can be truth-values (Boolean P-tree); can be quantified existentially (1 or a threshold %) or universally; Predicate-tree nodes can count # of true leaf children of that component (Count P-tree).

Purity-tree: universally quantified Boolean-Predicate-tree (e.g., if the predicate is , Pure1-tree or P1tree)A 1-bit at a node iff corresponding component is pure1 (universally quantified)There are many other useful predicates, e.g., NonPure0-trees; But we will focus on P1trees.All Ptrees shown so far were: 1-dimensional (recursively partition by halving bit files), but they can be; 2-D (recursively quartering) (e.g., used for 2-D images); 3-D (recursively eighth-ing), ; Or based on purity runs or LZW-runs or Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing

Formally, P-trees are be defined as any of the following;Partition-tree: Tree of nested partitions (a partition P(R)={C1..Cn}; each component of which is partitioned by P(Ci)={Ci,1..Ci,ni} i=1..n; each component of which is partitioned by P(Ci,j)={Ci,j1..Ci,jnij}, etc.Partition treeR/ \ C1 Cn /\ /\ C11C1,n1 Cn1Cn,nn . . .Further observations about Ptrees:Partition-tree: have set nodesPredicate-tree: have either Boolean nodes (Boolean P-tree) or count nodes (Count P-tree)Purity-tree: being universally quantified Boolean-Predicate-tree have Boolean nodes (since the count is always the full count of leaves, expressing Purity-trees as count-trees is redundant.Partition-tree can be sliced at a level if each partition is labeled with same label set (e.g., Month partition of years).A Partition-tree can be generalized to a Set-graph when the siblings of a node do not form a partition.

Vertical Select-Project-Join (SPJ) QueriesA Select-Project-Join query has joins, selections and projections.Typically there is a central fact relation to which several dimension relations are to be joined (standard STAR DW)E.g., Student(S), Course(C), Enrol(E) STAR DB below (bit encoding is shown in reduced font italics for certain attributes)

S|s____|name_|gen| C|c____|name|st|term| E|s____|c____|grade | |0 000|CLAY |M 0| |0 000|BI |ND|F 0| |0 000|1 001|B 10| |1 001|THAIS|M 0| |1 001|DB |ND|S 1| |0 000|0 000|A 11| |2 010|GOOD |F 1| |2 010|DM |NJ|S 1| |3 011|1 001|A 11| |3 011|BAID |F 1| |3 011|DS |ND|F 0| |3 011|3 011|D 00| |4 100|PERRY|M 0| |4 100|SE |NJ|S 1| |1 001|3 011|D 00| |5 101|JOAN |F 1| |5 101|AI |ND|F 0| |1 001|0 000|B 10| |2 010|2 010|B 10| |2 010|3 011|A 11| |4 100|4 100|B 10| |5 101|5 101|B 10|Vertical bit sliced (uncompressed) attrs stored as:S.s2S.s1S.s0S.gC.c2C.c1C.c0C.tE.s2E.s1E.s0E.c2E.c1E.c0E.g1E.g000000000000001100010001100000011 1000100100101100 1011101000100010 0101010101101111 0111011001101100 01001010 01000111 10010010 10110110Vertical (un-bit-sliced) attributes are stored:S.name C.name C.st |CLAY | |BI | |ND| |THAIS| |DB | |ND| |GOOD | |DM | |NJ| |BAID | |DS | |ND| |PERRY| |SE | |NJ| |JOAN | |AI | |ND|

O.o200001111O :o c r|0 000|0 00|0 01||1 001|0 00|1 01||2 010|1 01|0 00||3 011|1 01|1 01||4 100|2 10|0 00||5 101|2 10|2 10| |6 110|2 10|3 11| |7 111|3 11|2 10|C:c n cred|0 00|B|1 01||1 01|D|3 11| |2 10|M|3 11||3 11|S|2 10|Vertical preliminary Select-Project-Join Query Processing (SPJ)R:r cap|0 00|30 11||1 01|20 10||2 10|30 11||3 11|10 01|SELECT S.n, C.n FROM S, C, O, R, EWHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20;S:s n gen|0 000|A|M||1 001|T|M||2 100|S|F||3 111|B|F||4 010|C|M||5 011|J|F|E:s o grade|0 000|1 001|2 10||0 000|0 000|3 11||3 011|1 001|3 11||3 011|3 011|0 00||1 001|3 011|0 00||1 001|0 000|2 10||2 010|2 010|2 10||2 010|7 111|3 11||4 100|4 100|2 10||5 101|5 101|2 10|S.s2001100E.s20000000011C.c10011R.r10011S.s1000011S.s0010101S.nATSBCJS.gMMFFMFC.c00101C.nBDMSC.r10111C.r01110R.r00101R.c11110R.c01011E.s10011001100E.s00011110001E.o20000000111E.o10001101100E.o01011100101E.g11110011111E.g00110000100In the SCORE database (Students, Courses, Offerings, Rooms, Enrollments), numeric attributes are represente vertically as P-trees (not compressed).Categorical are projected to a 1 columnvertical fileO.o100110011O.o001010101O.c100001111O.c000110001O.r100000111O.r011010010

For selections, S.g=M=1b C.r=2=10b E.g=A=11b R.c=20=10b create the selection masks using ANDs and COMPLEMENTS.SM110010S.s2001100S.s1000011S.s0010101S.nATSBCJS.g1 10010E.s20000000011E.s10011001100E.s00011110001E.o20000001011E.o10001101100E.o01011100101E.g11110011111E.g00110000100C.c10011C.c10101C.nBDMSC.r10111C.r21110O.o200001111O.o100110011O.o001010101O.c100001111O.c000110001O.r100000111O.r011010010R.r10011R.r00101R.c11110R.c01011SELECT S.n, C.n FROM S, C, O, R, EWHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20;C.r10111C.r20001Cr20001E.g11110011111E.g00110000100EgA0110000100R.c11110R.c00100Rc200100Apply these selection masks (Zero out numeric values, blanked out others).S.s2000000S.s1000010S.s0010000S.nAT

C

E.s20000000000E.s10010000100E.s00010000000E.o20000000100E.o10000000100E.o00010000100C.c10001C.c00001C.n SO.o200001111O.o100110011001010101O.c100001111O.c000110001O.r100000111O.r011010010R.r10000R.r00100

For the joins, S.s=E.s C.c=O.c O.o=E.o O.r=R.r, one approach is to follow an indexed nested loop like method.(Noting that attribute P-trees ARE an index for that attribute).SELECT S.n, C.n FROM S, C, O, R, EWHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20;S.s2000000S.s1000010S.s0010000S.nAT

C

E.s20000000000E.s10010000100E.s00010000000E.o20000000100E.o10000000100E.o00010000100C.c10001C.c00001C.n SO.o200001111O.o100110011O.o001010101O.c100001111O.c000110001O.r100000111O.r011010010R.r10000R.r00100The join O.r=R.r is simply part of a selection on O (R doesnt contribute output nor participate in any further operations)Use the Rc20-masked R as the outer relationUse O as the indexed inner relation to produce that O-selection mask.Rc200100Get 1st R.r value, 01b (there's only 1) Mask the O tuples: PO.r1^PO.r0 O.r100000111O.r000101101OM00000101This is the only R.r value (if there were more, one would do the same for each, then OR those masks to get the final O-mask). Next, we apply the O-mask, OM to OO.o200000101O.o100000001O.o000000101O.c100000101O.c000000001

For the final 3 joins C.c=O.c O.o=E.o E.s=S.s the same indexed nested loop like method can be used.SELECT S.n, C.n FROM S, C, O, R, EWHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20;S.s2000000S.s1000010S.s0010000S.nAT

C

E.s20000000000E.s10010000100E.s00010000000E.o20000000100E.o10000000100E.o00010000100C.c10001C.c00001C.n SO.o200000101O.o100000001O.o000000101O.c100000101O.c000000001Get 1st masked C.c value, 11b Mask corresponding O tuples: PO.c1^PO.c0 O.c100000101O.c000000001OM00000001Get 1st masked O.o value, 111b Mask corresponding E tuples: PE.o2^PE.o1^PE.o0 E.o10000000100E.o00010000100Get 1st masked E.s value, 010b Mask corresponding S tuples: PS.s2^PS.s1^PS.s0 S.s2110010S.s1000010S.s0100010SM000010Get S.n-value(s), C, pair it with C.n-value(s), S, output concatenation, C.n S.nThere was just one masked tuple at each stage in this example. In general, one would loop through the masked portion of the extant domain at each level (thus, Indexed Horizontal Nested Loop or IHNL)E.o20000000100EM0000000100S C

Vertical Select-Project-Join-Classification QueryGiven previous SCORE Training Database (not presented as just one training table),

predict what course a male student will register for, who got an A in a previous course in Room with a capacity of 20.

This is a matter of applying the previous complex SPJ query first to get the pertinent Training table and then classifying the above unclassified sample

(e.g., using, 1-nearest neighbour classification).

The result of the SPJ is the single row Training Set, (S,C) and so the prediction is course=C.

Thank you.

based on similarityHow do decide on which pair of clusters to combine?How do decide on which pair of clusters to combine?

datamining_3 clustering methods

Documents

new clustering

new set

clustering methodsclustering

clustering annimatedcentroids

clustering methodgiven

initial means

cluster component

initial set