cluster description and related problems ddm lab seminar 2005 spring byron gao

30
Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

Cluster Description and Related Problems

DDM Lab Seminar

2005 Spring

Byron Gao

Page 2: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

2

Road Map

1. Motivations

2. Cluster Description Problems

3. Related Problems in Machine Learning

4. Related Problems in Theory

5. Related Problems in Computational Geometry

6. Conclusions and Past / Future Work

Page 3: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

3

Clustering

Group objects into clusters such that objects within a cluster are similar and objects from different clusters are dissimilar

– one of the major data mining tasks– K-means, hierarchical, density-based…

1. Motivations

Page 4: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

4

Cluster Descriptions

Literature lacks systematic study of cluster descriptions– Most clustering algorithms just give membership assignments– Hidden knowledge in data need to be represented for inference– Integration of database and data mining

Purposes of cluster descriptions– Summarize data to gain initial knowledge– Compress data supporting further investigation

One way of describing a cluster: DNF formula– A set of isothetic hyper-rectangles

Interpretable Used as search condition to retrieve objects

1. Motivations

Page 5: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

5

Gain initial and basic understanding of the clusters– Capture shape and location in the multi-dimensional space

Requirement: interpretability– Simple and clear structure of description format

e.g., DNF, B-DNF… B-DNF: Set difference between the bounding box B and a DNF formula describing objects in B but not in cluster C

– Short in length

Another requirement: accuracy– Interpreting the correct target– May trade accuracy for shorter length

Summarize Data

1. Motivations

Page 6: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

6

Compress Data (1)

Data compression has important applications in data

transmission and data storage

– Transmission: distribute clustering results to branches Clustering performed interactively by expert at head office

– Storage: support further investigation Partial results or results (in some sense, always partial results)

Description process ≈ encoding (compression)

Retrieval process ≈ decoding (decompression)

– Description as search condition in SELECT query statement

1. Motivations

Page 7: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

7

Requirement: compression ratio & retrieval efficiency– Short in length

comp. ratio = |C| / (|DESC| * 2) Typically < 0.01

Another requirement: accuracy– Retrieve desired objects– May trade accuracy for shorter length: lossy compression / faster retrieval

Compress Data (2)

1. Motivations

Page 8: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

8

Compress Data (3)

Support Further investigation– Exploratory queries without retrieval of original objects

e.g.,… Incremental clustering

– Exploratory queries requiring retrieval of original objects Information not associated with the description Information not revealed by the description Interactive and iterative mining environment

– Inductive database framework

1. Motivations

Page 9: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

9

SDL Problem

Shortest Description Length (SDL) problemGiven a cluster C, its bounding box B, and a description format, obtain a logical expression DESC in the given format with minimum length such that for any object o, (o C o DESC) (o B – C o DESC)

Lossless NP-hard

– variant of the Minimum Set Cover problem

2. Cluster Description Problems

Page 10: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

10

MDQ Problem

Maximum Description Quality (MDQ) problemGiven a cluster C, its bounding box B, an integer l, a description format, and a quality measure, obtain a logical expression DESC in the given format with length at most l such that the quality measure is maximized

Lossy NP-hard:

– reducible to the Set Cover problem

2. Cluster Description Problems

Page 11: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

11

MDQ Problem: Quality Measure

F-measure F = 2PR / (P + R)– Harmonic mean of P and R– R = |Cdescribed| / |C|

– P = |Cdescribed| / |Bdescribed| where Cdescribed (Bdescribed) is the set of objects in C (B) that

is described by the description

RP=1 (Recall at fixed Precision of 1)

PR=1 (Precision at fixed Recall of 1)

2. Cluster Description Problems

Page 12: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

12

Concept Learning

Concept learning: find a classifier for a concept– Given a labeled training sequence, generate a hypothesis

that is a good approximation of the target concept and can be used as a predictor for other query points

Typical machine learning problem– Most extensively studied problem in computational learning theory

Representation of target concept assumed known– DNF formulas, Boolean circuits, geometric objects, pattern

languages, deterministic finite automata, etc…

3. Related Problems in Machine Learning

Page 13: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

13

Geometric Concept Learning

Target is a geometric object– Half spaces. Separated by linear hyper planes

perceptron learning

– Intersections of half spaces. Convex polyhedra particularly, axis-parallel box

– More accurate yet simple enough to offer intuitive solutions

– Unions of axis-parallel boxes e.g., 2 boxes. Less studied

Popular topic in computational learning theory

3. Related Problems in Machine Learning

Page 14: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

14

Learning Models

PAC learning model: Probably Approximately Correct [15]

– Given a reasonable number of labeled instances, generate with high probability a hypothesis that is a good approximation of the target concept within a reasonable amount of time

Exact Learning Model [2]

– The learner is allowed to pose queries in order to exactly identify the target concept within a reasonable amount of time

– Common queries: answered by an oracle Membership query: feed instance to oracle, return its classification Equivalence query: present hypothesis to oracle, return “correct” or a

counterexample

3. Related Problems in Machine Learning

Page 15: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

15

Similarities

Learn unions of axis-parallel boxes

– DNF formula

Learner is fed with positive and negative examples

which need to be classified accurately

3. Related Problems in Machine Learning

Page 16: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

16

Dissimilarities (1) : objective

Geometric concept learning– exact or approximate identification of geometric object

related to geometric object construction in computational geometry– Ultimate goal is the classifying accuracy over the instance space

labeled instances are a small part of the instance space– Optimization: least disagreement problem

fixed number of boxes, usually 1 or 2 “+” and “–” examples are treated equally

Cluster description problems– Entire instance space is given– To describe “+” examples instead of a classify “+” and “–” examples– Optimization problem: SDL or MDQ problems.

For quality measure, “+” and “–” examples are not treated equally Examples could be labeled “1”, “5”, “9”... for clustering description

3. Related Problems in Machine Learning

Page 17: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

17

Dissimilarities (2) : research focus

Geometric concept learning– Learnability

Under certain learning models

– Learning from queries (membership, equivalence) rather than from random examples to significantly improve performance

– Low-dimensional

Cluster description problems– Heuristics– Large data set, possibly high-dimensional– Other concerns: compression tool, retrieval efficiency etc.

3. Related Problems in Machine Learning

Page 18: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

18

Two Traditional Problems

Minimum Set Cover problem [10]

– Given a collection S of subsets of a finite ground set X, find S’ S with minimum cardinality such that U Si S’ Si = X

– Approximable within ln|S| [8]: greedy set cover Iteratively picks up the subset that covers the maximum number of

uncovered elements

Maximum Coverage problem (max k-cover): a variant problem– Maximize the number of covered elements in X using at most k

subsets from S.– Approximable within e / (e – 1) by the greedy algorithm [8]

The ratios are optimal [6]

4. Related Problems in Theory

Page 19: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

19

Red Blue Set Cover Problem

Given a finite set of “red” elements R, a finite set of “blue”

elements B and a family S 2RUB, find a subfamily S’ S which

covers all blue elements and minimum number of red elements

Raised recently (00’) [4]

Related to Group Steiner tree, minimum monotone satisfying

assignment and minimum color path problems

Reducible to the set cover problem (at least as hard as) [4]

4. Related Problems in Theory

Page 20: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

20

SDL Problem is NP-hard

a variant of the Minimum Set Cover problem– Ground set X = set of objects in C, the cluster

being described– Collection S of rectangles (subsets):

Each Sj S is a rectangle minimally covering some objects in C and none objects not in C

S 2C

– Find S’ S with minimum cardinality such that

U Si S’ Si = C

4. Related Problems in Theory

Page 21: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

21

MDQ Problem is NP-hard

RP=1: a variant of the Maximum Coverage problem

– In the same fashion as we showed in last slide

– Given k, find S’ S with |S’| at most k such that the number

of covered elements in C is maximized

PR=1: a variant of the Red Blue Set Cover problem

F-measure: reducible to either of the two

4. Related Problems in Theory

Page 22: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

22

Similarities and Dissimilarities

Set Cover problem and its variants arise in a variety of settings

Useful to argue NP-hardness of SDL and MDQ problems

Not useful from algorithm design point of view because of the

essential differences: in SDL and MDQ problems, the collection

of subsets S is not explicitly given

– e.g., the greedy set cover algorithm cannot be applied

4. Related Problems in Theory

Page 23: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

23

Maximum Box Problem

Given two finite sets of points X + and X - in d, find an axis-parallel box B such that B ∩ X - = and the total number of points from X + covered is maximized

Raised recently (02’) [5] NP-hard when d is part of the input Studied by [12] [14] for d = 2 finding exact solutions

Corresponds to MDQ problem with k = 1 and RP=1

Techniques for low-dimensional and small datasets

5. Related Problems in Computational Geometry

Page 24: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

24

Rectilinear Polygon Covering Problem

Find a collection of axis-parallel rectangles with minimum cardinality whose union is exactly to a given rectilinear polygon

– Covering rectilinear polygon with axis-parallel rectangles with holes [7] (79’)– Different from partition, overlapping is allowed– 2-d

NP-complete [13] No possibility of polynomial time approximation scheme [3] A special case of the general set covering problem

– performance guarantee O(log n)– Recent result O(√ log n) [9]

5. Related Problems in Computational Geometry

Page 25: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

25

Similarities and Dissimilarities

If extended to higher-dimensional space, exactly the same formulation as in Greedy Growth [1]

If further allowing “don’t care” cells, similar formulation as in Algorithm BP [11]

If not restricted to grid data, same as SDL problem

[1] and [11] are related studies in DB literature Logic minimization is also related to the grid case

– Most researches are on Boolean variables

5. Related Problems in Computational Geometry

Page 26: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

26

Conclusions

Cluster (clustering) description is of fundamental importance to KDD, but not systematically studied

Formulated problems have significant differences to related problems in machine learning, theory, computational geometry and DB literatures.

6. Conclusions and Past / Future Work

Page 27: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

27

Past Work

Formulated and provided heuristic algorithms for

the SDL and MDQ problem– for both DNF and B-DNF formats

– work for both vector and grid data

Studied efficient retrieval issue– cost model

– efficiency estimation methodology

– optimal ordering of description terms

6. Conclusions and Past / Future Work

Page 28: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

28

Future Work

Optimization of algorithms: focused on introducing concepts and ideas

Real data experiments

Experimental validation of the retrieval efficiency issue: DBMS

Description alphabet: always rectangles?

– if not, getting farther away from some related problems

Clustering description: gaining quality and efficiency

– even more distinguishable from related problems

Incremental description: description process could take long

Interaction with clustering algorithms: online description?

6. Conclusions and Past / Future Work

Page 29: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

29

References:

1. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD 1998.

2. D. Angluin. Queries and concept learning. Machine Learning, 2(4): 319-342, 1988.

3. P. Berman and B. Dasgupta. Approximating rectilinear polygon cover problems. Algorithmica, 17: 331-356, 1997.

4. R.D. Carr, S. Doddi, G. Konjevod and M. Marathe. On the red-blue set cover problem. SODA 2000.

5. J. Eckstain, P. Hammer, Y. Liu, M. Nediak, and B. Simeone. The maximum box problem and its application to data analysis. Computational Optimization and Applications, 23(3): 285-298, 2002.

6. U. Feige. A threshold of ln n for approximating set cover. J.ACM 45(4): 634-652. 1998.

7. M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York, 1979.

8. D.S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. Approximation algorithms for NP-hard problems. PWS, New York, 1997.

9. V.S.A. Kumar and H. Ramesh. Covering rectilinear polygons with axis-parallel rectangles. SIAM J. Comput. 32(6): 1509-1541, 2003.

10. D.S. Johnson. Approximation algorithms for combinatorial problems. J.Comput.System Sci., 9: 256-278, 1974.

11. L.V.S. Lakshmanan, R.T. Ng, C.X. Wang, X. Zhou and T.J. Johnson. The generalized MDL approach for summarization. VLDB 1999.

12. Y. Liu and M. Nediak. Planar case of the maximum box and related problems. CCCG 2003.

13. W.J. Masek. Some NP-Complete Set Covering Problems, manuscript, MIT, Cambridge, MA, 1979.

14. M. Segal. Planar maximum box problem. Journal of Mathematical Modelling and Algorithms, 3: 31-38, 2004.

15. L.G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134-1142, 1984.

Page 30: Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao

30

Questions?