clustering for web documents 1 박흠. clustering for web documents 2 contents cluto criterion...

36
Clustering for web docum ents 1 Clustering for web documents 박박

Upload: ashanti-dison

Post on 14-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 1

Clustering for web documents

박흠

Page 2: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 2

ContentsCluto

Criterion Functions for Document Clustering* Experiments and Analysis (2002)

by Ying Zhao and George Karypis Department of Computer Science, University of Minnesot

a, Minneapolis, MN 55455

Feature selection for web documents(2004)

Page 3: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 3

ClutoClustering Toolkit. 2.1.1

Department of Computer Science, University of Minnesota, Minneapolishttp://www-users.cs.umn.edu/~karypis/platform

Linux 2.4.18Sun OS 5.7Win32

programsCLUTO's user callable libraryvclusterscluster

Page 4: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 4

ClutoWhat is Cluto.(1/2)

Clustering algorithmspartitional clusteringagglomerative clusteringgraph-partitioning clustering

clustering criterion functionprovide seven different criterion functions

both partitional and agglomerative clustering algorithms

provide some of the more traditional local criteria (e.g., single-link, complete-link, and UPGMA)

agglomerative clustering.

Page 5: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 5

ClutoWhat is Cluto.(2/2)

Analyze discovered clustersrelations between the objects assigned to each clusterrelations between the different clustersidentify the features that best describe and/or discriminate each cluster. relationships between the clusters, objects, and features.

operate on very large datasetsthe number of objectsthe number of dimensions.

Page 6: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 6

ClutoPrograms

vclusteroperate in the object’s feature space

sclusteroperate in the object’s similarity space.

Interfacevcluster [optional parameters] MatrixFile Ncluster

n*m matrix. rows to objects, cols to features spaceNcluster : number of cluster

Page 7: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 7

ClutoParameters of Algorithms

rd, rdrk-1 repeated bisections. (rdr : optimize the criterion function)

directcomputed by simultaneously finding all k clusters

agglothe agglomerative paradigm

graphusing a nearest-neighbor graph

bagglo

Page 8: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 8

ClutoParameters of the similarity function

cos the cosine function. default.corr the correlation coefficient.dist the Euclidean distance

applicable when -clmethod=graph.

jacc the extended Jaccard coefficient. applicable when -clmethod=graph.

Page 9: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 9

ClutoParameters of the criterion function

i1, i2, e1, g1, g1p, h1, h2

Page 10: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 10

ClutoParameters of the criterion function

slink single linkwslink weighted single linkclink complete linkwclink weighted complete linkupgmaUPGMA

cstypefulltreerowmodel, colmodelshowfeatures

Page 11: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 11

Page 12: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 12

Criterion Functions for Document Clustering Experiments and Analysis (200

2)

by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455

Page 13: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 13

Data Clustering

A.K. JAINMichigan State University

M.N. MURTYIndian Institute of Science

ANDP.J. FLYNN

The Ohio State UniversityACM Computing Surveys

Page 14: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 14

Introduction(1/2)Clustering algorithms

Agglomerative algorithmsUPGMA, single-link, complete-link, CURE, ROCK, Chameleon

Partitional algorithmsK-means, K-medoids, Autoclass, graph-partitional-based, spectral-partitional-basedwell suit for large datasets. so fast.

Seven Criterion functionsmeasure intra-cluster similarity, inter-cluster similarity, two combinations. i1, i2, e1, g1, g1p, h1, h2

Page 15: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 15

Introduction(2/2)Datasets

15 different data sets

Page 16: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 16

Preliminaries(1/3)Document Representation

use vector space model for each document

d : document, tf : term frequency, tfi : frequency of i-th term in the doc

use idf or tf*idf N : total documents

Similarity MeasuresThe similarity between two docs di, dj

Cosine functions ||d|| : normalize the length of doc vector

1 : identical, 0 : nothing in common

Page 17: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 17

Preliminaries(2/3)Euclidean functions

if dis=0, docs are identical, if , nothing in common.

DefinitionsS : set of documents

S1, S2, … Sk : set of document of k-th clusterk : number of clusters

n1, n2, … nk : size docs of the corresponding clustersA : a set of docs

composite vector DA centroid vector CA. sum of all docs vector in A average the weight of terms of docs in A

Page 18: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 18

Preliminaries(3/3)Vector Properties

Si, Sj : two sets of docs containing ni, nj documents

Di, Dj : the composite vector, Ci, Cj : the centroid vector

The sum of the pair similarity between the docs in Si and Sj is Dj

tDj

The sum of the pair similarity between the docs in Si is ||Di||2

Page 19: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 19

Criterion Functions(1/5)Internal Criterion Functions

maximize sum of the average pairwise similarities between the docs to each clusteruse cosine function. I1

is similar to function of hierarchical agglomerative clustering that uses group average heuristics to determine merge.

use cosine function. I2

: vector space of K-means algorithm. Cr : centroid vector of clusters

Page 20: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 20

Criterion Functions (2/5)External Criterion Functions. E1, E2

optimize a function that different from each clusterexternal function derived that the centroid vectors of the different clusters as orthogonal as possible

C : the centroid vector of the entire docs

D : the composite vector of the entire docs. 1/||D|| is constant.

Page 21: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 21

Criterion Functions (3/5)

define with the Euclidean distance function.

Hybrid Criterion Functions. H1, H2maximize the similarity of docs in each cluster, minimize the similarity between the cluster’s docs and the entire docsH1. combine criterion function I1, E1

Page 22: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 22

Criterion Functions (4/5)H2. combine criterion function I2, E1

Graph Based Criterion Functionsview the relations between docs is to use graphsG1 : computing pairwise similarities between the docsG2 : computing pairwise similarities between the docs and terms

S : given collection of n docsGs : similarity graph

Page 23: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 23

Criterion Functions (5/5)G1.

G2.

Page 24: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 24

Page 25: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 25

Page 26: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 26

Experimental ResultsDirect k-way Clustering

Page 27: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 27

Experimental Results

Page 28: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 28

Experimental Results

Page 29: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 29

Data Sets‘the Natural Science’ category in Naver directory (http://dir.naver.com)6 subcategories in corpora

1,215 docs, 17,223 terms, 20 clusters, 5 features per a doc, idf

Sub Category No. of Docs.

Sub Category No. of Docs.

Physics 102 Earth science 149

Biology 426 Astrology 323

Mathematics 102 Chemistry 113

Total 1,215

Page 30: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 30

Experimental parametersAlgorithms

rd, rdrk-1 repeated bisections. (rdr : optimize the criterion function)

directcomputed by simultaneously finding all k clusters

agglothe agglomerative paradigm

graphusing a nearest-neighbor graph

Page 31: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 31

Experimental parameters

Criterion Functionsi1, i2, e1, g1, g1p, h1, h2, clink, slink

Similarity Functionscosine measure

Page 32: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 32

Experimental resultsEntropy

rb rbr direct agglo graphI1 .464 .452 .490 .642 .417

I2 .379 .375 .374 .564

E1 .388 .398 .416 .540

G1 .389 .418 .398 .895

G1p .326 .366 .391 .562

H1 .386 .392 .386 .541

H2 .348 .352 .367 .559

Clink .761

slink .895

Page 33: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 33

Entropy

00.10.20.30.40.50.60.70.80.9

rd rdr direct agglo graph

I1I2E1G1G1pH1H2Clinkslink

Page 34: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 34

Experimental resultsPurity

rb rbr direct agglo graphI1 .686 .690 .683 .548 .749

I2 .772 .762 .761 .629

E1 .741 .737 .723 .647

G1 .768 .739 .752 .367

G1p .780 .758 .758 .647

H1 .753 .744 .758 .634

H2 .780 .782 .751 .650

Clink .458 Cut functio

nsslink .368

Page 35: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 35

Purity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

rb rbr direct agglo graph

I1I2E1G1G1pH1H2Clinkslink

Page 36: Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis

Clustering for web documents 36

Best results

rb rbr direct agglo graphentr puri entr puri entr puri entr puri entr puri

g1p h2 h1 h1 cut0.32

60.78

00.35

20.78

20.38

60.75

80.54

10.63

40.41

70.74

9