web page clustering based on web community extraction chikayama-taura lab. m2 shim wonbo

46
Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Upload: marjory-jackson

Post on 26-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Web Page Clustering based on

Web Community Extraction

Chikayama-Taura Lab.M2 Shim Wonbo

Page 2: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Background

Directory= Category

Page 3: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Open Directory Project Used by Google, Lycos, etc. Categorizing Web pages by hand

Accurate

Lately updated Unscalable

Page 4: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

World Wide Web Rapid increase (= # of clusters changes) Daily updated (= cluster centers move)

Due to these two properties of the Web.. A Web page clustering system without human

effort is needed.

Page 5: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Purpose Constructing a Web page clustering

system which finds clusters without human help is scalable clusters Web pages in high speed clusters Web pages accurately

Page 6: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Brief System View

(a) Web pages

DBGExtraction

(b) Web Communities (c) Web Page Clustering

Partitioning ofremaining pagesbased on TF-IDF

Page 7: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Contribution Web Community

A new Web community topology is defined. Extracted Web community shows higher

precision than existing work.

Web Page Clustering An approach to exploit Web communities as

centroids of clusters in TF-IDF space is taken. Experimental results show meaningful clusters.

Page 8: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Agenda Introduction Related Work Proposal Evaluation Conclusion

Page 9: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Existing Work Text-based clustering

Use of terms as feature Generally used algorithm

ex) k-means, Hierarchical algorithm, Density-based clustering

Link-based clustering Called as Web community extraction Extracting dense subgraphs from the Web graph

Conjunction of text and link information ex) Contents-Link Coupled Web Page Clustering [Yitong

et al., DEWS2004]

Page 10: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Text-based Clustering Merit

Accurate (because of considering text) Problem

Unsupervised clustering Complex to decide the number of clusters

Supervised learning and clustering Difficult to label each training datum

Page 11: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Contents-Link Coupled Web Page Clustering [Yitong et al., DEWS2004]

FeatureTerm frequency (pterm), Out-link (pout), In-link (pin)

Similarity

Clustering Algorithm An extension of the k-means algorithm

Page 12: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Extraction of Web Community based on Link Analysis An Approach to Find Related Communities

Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]

PlusDBG: Web Community Extraction Scheme Improving Both Precision and Pseudo-Recall [Saida et al, 2005]

Page 13: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Terminology Fan and Center

Bipartite Graph (BG) Complete BG (CBG) Dense BG (DBG)

Fan Center

(a) CBG (b) DBG

pq

),( tt qqpp

Page 14: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Algorithm for Extracting DBG [Reddy et al., 2001] Finds bipartite graph using co-citing and

co-cited Web pages Extracts a DBG from above graph

Seed page

2

4

3

3

1

DBG(3, 3)1

3

3

3

3

Page 15: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

PlusDBG Uses distance defined by co-citing page

rate between two pages Finds co-citing pages which are within

distance threshold Extracts a DBG from above graph

PlusDBG shows higher precision than DBG does.

Page 16: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Web Community ExtractionO High speedO Finding out topics over the Web

X Possibility of extracting unrelated Web pages as a community

Page 17: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Problem of DBG

Page 18: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Improvement of PlusDBG

Page 19: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Agenda Introduction Related Work Proposal Evaluation Conclusion

Page 20: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Proposal1. Extracts Web communities using link

structure.2. Assigns remainders to the closest Web

community in TF-IDF space.

Page 21: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Connecter Fan which is citing two centers.

Connectable If two centers are connectable, the centers have

more than two connecters. Web Community

A Web Community C is a DBG composed of connectable centers and connecters.

Connectable centersConnecter

Proposed Web Community

Page 22: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

All center is connectableto another one.

Proposed Web Community

Page 23: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Extraction Algorithm

b

c

d

e

f

g

h

i

a

j

S={}T={g}

S’={a,b,c,d}T’={e,f,h,i,j}t’=j# connecters = 1

T’={e,f,h,i}t’=i# connecters = 3

S={b,c,d}T={g,i}

Output Community = {a,b,c,d,e,f,g,h,i}

Page 24: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Labeling Remainders Remainder: a Web page which is not

extracted as a member of communities.

1. Calculate centroids of Web communities.

2. Label remainders with Web community ID

w.r.t vi is the TF-IDF vector of a page v

Page 25: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Agenda Introduction Related Work Proposal Evaluation

Preprocess Web community extraction Labeling result

Conclusion

Page 26: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Preprocess Data set

2.34 M pages, 20 M links Almost 80% of data set is Japanese pages.

Create a link-only file Links to out of data set are deleted. Duplicates are deleted which share 90% of links. Pages including 50 links are deleted. Remained data set: 1.45 M pages, 5.09 M links

Create a TF-IDF file Used TF-IDF: Parser: MeCab Terms which appeared in less than 0.1% or more than 90% of

total documents are removed

Page 27: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Distribution of Web Community Size

Page 28: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Distribution of Web Community Size

# communities# extracted pages

PlusDBG 0.8 22,902 865,945

PlusDBG 1.0 8,077 922,053

PlusDBG 1.2 7,527 923,100

Proposed method

50,065 648,626

Page 29: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Distance from centroids to term vectors

Page 30: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Variance of distance

Page 31: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Example of Web communities About motor bike manufacturers and links.

http://bike.ak-m.jp/ http://www.bike-cube.jp/ http://bike.ak-m.jp/2006/01/post_32.html http://www.bike-cube.jp/index.php http://bike.ak-m.jp/2006/11/post_20.html http://www.kymco.co.jp/ http://www1.suzuki.co.jp/motor/ http://www.yamaha-motor.jp/mc/ http://bike.ak-m.jp/ http://www.peugeot-moto.com/ http://www.apriliajapan.co.jp/index.html http://www.buell.jp/ http://www.cagiva.co.jp/ http://www.mitsuoka-motor.com/ http://www.ducati.com/od/ducatijapan/jp/index.jhtml http://www.triumphmotorcycles.com/japan/ http://www.harley-davidson.co.jp/index.html http://www.ktm-japan.co.jp/

Page 32: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Comparing to ODP Definition of precision

1. From a Web community C, let page subset existing in ODP OC.

2. If |OC| < 3, the precision of C is undefined.3. For r in OC, the Pscore of r is:

4. With Pscore, the precision of C is:

Comparing to the 4th and 5th level of ODP directories (Top/Regional/Japan/Arts/Movie)

The number of ODP pages included in the data set: 47,093

score(p, q) = 1, p, q in same directoryscore(p, q) = 0, otherwise

Page 33: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Comparing to ODP

# pages of ODP# communities including ODP pages

# directories which the pages belong to

PlusDBG 0.8 23,287 459 426

PlusDBG 1.0 25,016 156 430

PlusDBG 1.2 25,405 81 435

Proposed Method 12,406 4811 337

Page 34: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Precision of Web Communities(4th level)

Page 35: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Precision of Web communities(5th level)

Page 36: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Summary of Web Community Extraction The proposed method extracted smaller

Web communities than PlusDBG did. Members of each community were closer

to the centroid in the TF-IDF space than members of PlusDBG were.

My communities showed higher precision than PlusDBG’s when comparing to ODP.

Page 37: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Labeling Result Ignore pages including less than 10 terms. Compare to the ODP

ODP pages: 29,153 ODP directories: 1,862

Page 38: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Labeling Result (the 4th level)

Page 39: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Labeling Result (the 5th level)

Page 40: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Labeling example

Page 41: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Labeling example

Page 42: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Summary and Conclusion A DBG structure is defined as the Web community

topology. All two centers should be connectable. All fan is a connecter of centers. My DBG structure extracts more compact and more

precise Web communities than existing work does.

Clustering based on the Web community extraction is proposed. The centroids of communities in TF-IDF space are used

in labeling of remainders. Clustering result showed meaningful page groups.

Page 43: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Future Work Coupling feature selections for

improvement on the labeling result. Clustering extracted centroids.

Page 44: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

発表文献(発表予定) ウェブコミュニティ抽出アルゴリズムの改良、沈 垣甫、田浦 健次郎、近山 隆、データ工学ワークショップ、 2007

Page 45: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

Thank you for attention

Page 46: Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo

1. Select seed page t and set T={t}, S={}.2. Find S’ of which members cite any page in T.3. Find T’ of which members cited by any page

in T and are not in T.4. Determine that t’∈T’ is connectable to all

pages in T.1. If t’ is connectable, set T=T∪{t’} and

S={connecters} and go to 2.2. If not, select other t’∈T’ and go to 4.

5. If |S| > 3 and |T| > 3, extract the page set as a Web Community and delete from the Web Graph.

6. If any t exists, go to 1.

Extraction Algorithm