an improved fuzzy system for representing web pages in clustering tasks

83
An Improved Fuzzy System for Representing Web Pages in Clustering Tasks PhD Thesis Alberto Pérez García-Plaza UNED – NLP & IR group October 23, 2012 Advisors: Raquel Martínez Unanue Víctor Fresno Fernández

Upload: alberto-perez

Post on 19-Dec-2014

221 views

Category:

Technology


0 download

DESCRIPTION

Slides for my PhD Thesis defense http://www.garciaplaza.com/phd.php

TRANSCRIPT

Page 1: An improved fuzzy system for representing web pages in Clustering Tasks

An Improved Fuzzy System for Representing Web Pages in

Clustering Tasks

PhD ThesisAlberto Pérez García-Plaza

UNED – NLP & IR groupOctober 23, 2012

Advisors:Raquel Martínez UnanueVíctor Fresno Fernández

Page 2: An improved fuzzy system for representing web pages in Clustering Tasks

Table of Contents

1. Introduction2. Web Page Representation and Fuzzy Logic3. Adjusting the representation4. Test Scenario: Taxonomy Learning5. Conclusions & Outlook6. Publications

2/83

Page 3: An improved fuzzy system for representing web pages in Clustering Tasks

Table of Contents

1. Introduction1. Motivation2. Objectives

2. Web Page Representation and Fuzzy Logic3. Adjusting the representation4. Test Scenario: Taxonomy Learning5. Conclusions & Outlook6. Publications

3/83

Page 4: An improved fuzzy system for representing web pages in Clustering Tasks

Motivation• Document clustering is grouping documents based only in the

documents themselves.

1. In

trod

uctio

n

4/83WEB PAGE CLUSTERING

Page 5: An improved fuzzy system for representing web pages in Clustering Tasks

Motivation• Document representation plays a key role in clustering.• Representation comes first.• We focus on Document Representation• Characteristics employed…• …and the way of using them.

1. In

trod

uctio

n

5/83

Page 6: An improved fuzzy system for representing web pages in Clustering Tasks

State of the Art• TF-IDF a de facto standard.• Combination of criteria:• Linear approaches.• Algorithm.

• Hyperlinks.• Datasets for evaluation differ from one work to another.

1. In

trod

uctio

n

6/83

Page 7: An improved fuzzy system for representing web pages in Clustering Tasks

Web Page Example• Our criteria:

1. In

trod

uctio

n

7/83

Page 8: An improved fuzzy system for representing web pages in Clustering Tasks

Web Page Example• Our criteria: Title

1. In

trod

uctio

n

8/83

Page 9: An improved fuzzy system for representing web pages in Clustering Tasks

Web Page Example• Our criteria: Emphasis

1. In

trod

uctio

n

9/83

Page 10: An improved fuzzy system for representing web pages in Clustering Tasks

Web Page Example• Our criteria: Frequency

1. In

trod

uctio

n

10/83

Page 11: An improved fuzzy system for representing web pages in Clustering Tasks

Web Page Example• Our criteria: Position (Standard, Preferential)

1. In

trod

uctio

n

11/83

Page 12: An improved fuzzy system for representing web pages in Clustering Tasks

• Linear Combination of Criteria:

• Each criterion is multiplied by a constant.• Constants try to establish the importance of each criterion.

Combining Criteria

1. In

trod

uctio

n

12/83

What’s the problem?

Page 13: An improved fuzzy system for representing web pages in Clustering Tasks

Combining Criteria

1. In

trod

uctio

nCall to Arms

13/83

Page 14: An improved fuzzy system for representing web pages in Clustering Tasks

Combining Criteria

1. In

trod

uctio

n

14/83

Title terms not related to the theme of the document.

Page 15: An improved fuzzy system for representing web pages in Clustering Tasks

Combining Criteria• Need• Related conditions to establish word importance.

• Fuzzy Logic because:• Declare the knowledge without specifying the calculation.• Rules close to natural language (IF - THEN).• Relations among criteria.• Ease the task of expressing heuristic knowledge.

• Other kind of systems requires an additional effort to understand how the system works to be able to modify them.

1. In

trod

uctio

n

15/83

Page 16: An improved fuzzy system for representing web pages in Clustering Tasks

Problem Statement

To study and improve a web page1 representation based on fuzzy logic2 applied to clustering tasks.

(1) HTML documents.(2) FCC, Víctor Fresno, PhD Thesis (2006).

1. In

trod

uctio

n

16/83

Page 17: An improved fuzzy system for representing web pages in Clustering Tasks

Objectives

1. Compare the fuzzy system with TF-IDF as standard method and different dimension reduction methods.

2. Analyze an existing fuzzy combination of criteria (FCC).3. Assess the possibility of adding new criteria beyond

document contents.4. Adjust the representation to concrete datasets.5. Evaluate our proposals in hierarchical clustering.6. Evaluate our methods in more than one language.

1. In

trod

uctio

n

17/83

Page 18: An improved fuzzy system for representing web pages in Clustering Tasks

Table of Contents

1. Introduction2. Web Page Representation and Fuzzy Logic

1. Dimension Reduction2. Criteria Analysis3. New Criteria

3. Adjusting the representation4. Test Scenario: Taxonomy Learning5. Conclusions & Outlook6. Publications

18/83

Page 19: An improved fuzzy system for representing web pages in Clustering Tasks

Representation

Web Page Representation

Term Weighting

Dimension Reduction

ClusteringEvaluation19/83

Page 20: An improved fuzzy system for representing web pages in Clustering Tasks

Overview of the Fuzzy System

20/83

Knowledge base

Page 21: An improved fuzzy system for representing web pages in Clustering Tasks

Overview of the Fuzzy System

Knowledge base

21/83

Page 22: An improved fuzzy system for representing web pages in Clustering Tasks

DatasetsDataset # Documents # Categories Language Hierarchical

Banksearch 9,897 10* English No

WebKB 4,518 6 English No

SODP 12,148 17 English No

WAD 166 4 – 1st level17 – 2nd level

English & Spanish

Yes

(*) approx. same number of documents within each category.

22/83

Page 23: An improved fuzzy system for representing web pages in Clustering Tasks

Basic Clustering Settings• Stop words removal & Stemming (Porter).• Cluto-rbr with default parameters.• Initial Weighting Functions: TF-IDF and FCC.• Dimension Reduction Methods (100, 500, 1000, 2000, 5000

features): DF, LSI, RP, MFT.• F-measure to evaluate clustering quality.

23/83

Page 24: An improved fuzzy system for representing web pages in Clustering Tasks

MFT Reduction

24/83

1. Soccer2. Goal3. Referee4. …

1. Music2. Show3. Band4. …

1. Goal2. Ball3. Soccer4. …

1. Music2. Band3. Album4. …

Our proposal for dimension reduction:

Music Soccer Goal Show Band Ball

…until the desired number of terms is reached.

Rank terms within each document

Page 25: An improved fuzzy system for representing web pages in Clustering Tasks

Dimension Reduction Experiments

Representation Avg. F S. D.

Banksearch

TF-IDF MFT 0.748 0.028

TF-IDF LSI 0.756 0.005

FCC MFT 0.756 0.019

FCC LSI 0.769 0.011

WebKB

TF-IDF MFT 0.460 0.051

TF-IDF LSI 0.507 0.006

FCC MFT 0.469 0.009

FCC LSI 0.466 0.01125/83

Comparison: MFT Vs. LSI

Both methods over TF-IDF and FCC

Page 26: An improved fuzzy system for representing web pages in Clustering Tasks

Representation Avg. F S. D.

Banksearch

TF-IDF MFT 0.748 0.028

TF-IDF LSI 0.756 0.005

FCC MFT 0.756 0.019

FCC LSI 0.769 0.011

WebKB

TF-IDF MFT 0.460 0.051

TF-IDF LSI 0.507 0.006

FCC MFT 0.469 0.009

FCC LSI 0.466 0.011

Dimension Reduction Experiments

• LSI outperforms MFT.• FCC and TF-IDF are not working as

well as they could.

26/83

Comparison: MFT Vs. LSI

Page 27: An improved fuzzy system for representing web pages in Clustering Tasks

Representation Avg. F S. D.

Banksearch

TF-IDF MFT 0.748 0.028

TF-IDF LSI 0.756 0.005

FCC MFT 0.756 0.019

FCC LSI 0.769 0.011

WebKB

TF-IDF MFT 0.460 0.051

TF-IDF LSI 0.507 0.006

FCC MFT 0.469 0.009

FCC LSI 0.466 0.011

Dimension Reduction Experiments

• LSI outperforms MFT.• FCC and TF-IDF are not working as

well as they could.

• FCC in WebKB obtains bad results, even with LSI.

27/83

Comparison: MFT Vs. LSI

Page 28: An improved fuzzy system for representing web pages in Clustering Tasks

Analysis of the Combination

Rep.\Dim. 100 500 1000 2000 5000

FCC MFT 0.723 0.757 0.768 0.765 0.768

title 0.626 0.646 0.632 0.634 0.639

emphasis 0.586 0.671 0.674 0.685 0.693

frequency 0.689 0.715 0.720 0.724 0.731

position 0.310 0.525 0.538 0.599 0.608

Banksearch

The combination always outperforms individual criteria.

Frequency seems to the be the best among individual criteria.

28/83

Page 29: An improved fuzzy system for representing web pages in Clustering Tasks

Analysis of the Combination

Rep.\Dim. 100 500 1000 2000 5000

FCC MFT 0.723 0.757 0.768 0.765 0.768

title 0.626 0.646 0.632 0.634 0.639

emphasis 0.586 0.671 0.674 0.685 0.693

frequency 0.689 0.715 0.720 0.724 0.731

position 0.310 0.525 0.538 0.599 0.608

Banksearch

The combination always outperforms individual criteria.

Frequency seems to the be the best among individual criteria.

29/83

Page 30: An improved fuzzy system for representing web pages in Clustering Tasks

Analysis of the Combination

Rep.\Dim. 100 500 1000 2000 5000

FCC MFT 0.723 0.757 0.768 0.765 0.768

title 0.626 0.646 0.632 0.634 0.639

emphasis 0.586 0.671 0.674 0.685 0.693

frequency 0.689 0.715 0.720 0.724 0.731

position 0.310 0.525 0.538 0.599 0.608

Banksearch

The combination always outperforms individual criteria.

Frequency seems to the be the best among individual criteria.

30/83

Page 31: An improved fuzzy system for representing web pages in Clustering Tasks

Analysis of the Combination

Rep.\Dim. 100 500 1000 2000 5000

FCC MFT 0.453 0.472 0.475 0.468 0.475

title 0.432 0.433 0.404 0.488 0.479

emphasis 0.415 0.431 0.433 0.465 0.489

frequency 0.441 0.460 0.460 0.468 0.446

position 0.301 0.283 0.317 0.281 0.286

WebKB

The combination does not always outperform the others.

Frequency is not always the best among individual criteria.

When title and emphasis could lead to a better clustering, the combination get worse.

31/83

Page 32: An improved fuzzy system for representing web pages in Clustering Tasks

Analysis of the Combination

Rep.\Dim. 100 500 1000 2000 5000

FCC MFT 0.453 0.472 0.475 0.468 0.475

title 0.432 0.433 0.404 0.488 0.479

emphasis 0.415 0.431 0.433 0.465 0.489

frequency 0.441 0.460 0.460 0.468 0.446

position 0.301 0.283 0.317 0.281 0.286

WebKB

The combination does not always outperform the others.

Frequency is not always the best among individual criteria.

When title and emphasis could lead to a better clustering, the combination get worse.

32/83

Page 33: An improved fuzzy system for representing web pages in Clustering Tasks

Analysis of the Combination

Rep.\Dim. 100 500 1000 2000 5000

FCC MFT 0.453 0.472 0.475 0.468 0.475

title 0.432 0.433 0.404 0.488 0.479

emphasis 0.415 0.431 0.433 0.465 0.489

frequency 0.441 0.460 0.460 0.468 0.446

position 0.301 0.283 0.317 0.281 0.286

WebKB

The combination does not always outperform the others.

Frequency is not always the best among individual criteria.

When title and emphasis could lead to a better clustering, the combination get worse.

33/83

Page 34: An improved fuzzy system for representing web pages in Clustering Tasks

Analysis of the Combination

Rep.\Dim. 100 500 1000 2000 5000

FCC MFT 0.453 0.472 0.475 0.468 0.475

title 0.432 0.433 0.404 0.488 0.479

emphasis 0.415 0.431 0.433 0.465 0.489

frequency 0.441 0.460 0.460 0.468 0.446

position 0.301 0.283 0.317 0.281 0.286

WebKB

The combination does not always outperform the others.

Frequency is not always the best among individual criteria.

When title and emphasis could lead to a better clustering, the combination gets worse.

34/83

Page 35: An improved fuzzy system for representing web pages in Clustering Tasks

Analysis of the Combination• Position is considered more decisive than others.• But position empirically got the worst results.• Its heuristics are based on written texts and not in web pages.• Sample rule:• IF title IS low AND frequency IS medium AND emphasis IS high AND position IS preferential THEN importance IS very high

35/83

Page 36: An improved fuzzy system for representing web pages in Clustering Tasks

EFCC Rule Base

36/83

Page 37: An improved fuzzy system for representing web pages in Clustering Tasks

EFCC Rule Base

Title, Emphasis and Position+

Frequency

37/83

Page 38: An improved fuzzy system for representing web pages in Clustering Tasks

EFCC Experiments

Representation Avg. F S. D.

Banksearch

TF-IDF LSI 0.756 0.005

FCC LSI 0.769 0.011

EFCC MFT 0.760 0.014

EFCC LSI 0.758 0.013

WebKB

TF-IDF LSI 0.507 0.006

FCC LSI 0.469 0.011

EFCC MFT 0.532 0.032

EFCC LSI 0.483 0.00038/83

Comparison: EFCC Vs. FCC & TF-IDF

Page 39: An improved fuzzy system for representing web pages in Clustering Tasks

EFCC Experiments

Representation Avg. F S. D.

Banksearch

TF-IDF LSI 0.756 0.005

FCC LSI 0.769 0.011

EFCC MFT 0.760 0.014

EFCC LSI 0.758 0.013

WebKB

TF-IDF LSI 0.507 0.006

FCC LSI 0.469 0.011

EFCC MFT 0.532 0.032

EFCC LSI 0.483 0.000

• Banksearch: with EFCC both reductions get similar results.

39/83

Comparison: EFCC Vs. FCC & TF-IDF

Page 40: An improved fuzzy system for representing web pages in Clustering Tasks

EFCC Experiments

Representation Avg. F S. D.

Banksearch

TF-IDF LSI 0.756 0.005

FCC LSI 0.769 0.011

EFCC MFT 0.760 0.014

EFCC LSI 0.758 0.013

WebKB

TF-IDF LSI 0.507 0.006

FCC LSI 0.469 0.011

EFCC MFT 0.532 0.032

EFCC LSI 0.483 0.000

• Banksearch: with EFCC both reductions get similar results.

• WebKB: EFCC seems to solve the problems of FCC.

40/83

Comparison: EFCC Vs. FCC & TF-IDF

Page 41: An improved fuzzy system for representing web pages in Clustering Tasks

EFCC Experiments

Representation Avg. F S. D.

Banksearch

TF-IDF LSI 0.756 0.005

FCC LSI 0.769 0.011

EFCC MFT 0.760 0.014

EFCC LSI 0.758 0.013

WebKB

TF-IDF LSI 0.507 0.006

FCC LSI 0.469 0.011

EFCC MFT 0.532 0.032

EFCC LSI 0.483 0.000

• Banksearch: with EFCC both reductions get similar results.

• WebKB: EFCC seems to solve the problems of FCC.

• EFCC with MFT seems to be a good alternative to TF-IDF with LSI.

41/83

MFT is cheaper than LSI

Comparison: EFCC Vs. FCC & TF-IDF

Page 42: An improved fuzzy system for representing web pages in Clustering Tasks

Criteria Beyond the Document• Add IDF to EFCC:

Representation Avg. F S. D.

Banksearch

EFCC MFT 0.760 0.014

EFCC-IDF MFT 0.749 0.129

WebKB

EFCC MFT 0.532 0.032

EFCC-IDF MFT 0.350 0.070

• EFCC-IDF does not work in WebKB.

• IDF strongly affects EFCC.

• WebKB unbalanced categories.

42/83

Comparison: EFCC Vs. EFCC-IDF

Page 43: An improved fuzzy system for representing web pages in Clustering Tasks

Criteria Beyond the Document• Add information from Anchor Texts:• We collect up to 300 unique inlinks for each SODP page (~ 1M).

• Two experiments.• Three alternatives for each experiment.

43/83

(a) Anchors as plain text.(b) Anchors as titles.

(1) Just adding anchors.(2) Removing outlinks.(3) Removing stopwords.

Page 44: An improved fuzzy system for representing web pages in Clustering Tasks

Criteria Beyond the DocumentRepresentation Avg. F S. D.

SODP

FCC MFT 0.242 0.028

EFCC MFT 0.275 0.025

EFCC a-1 MFT 0.268 0.027

EFCC a-2 MFT 0.267 0.024

EFCC a-3 MFT 0.276 0.022

EFCC b-1 MFT 0.277 0.015

EFCC b-2 MFT 0.270 0.016

EFCC b-3 MFT 0.267 0.012

44/83

EFCC Vs. EFCC + Anchor texts

(a) Anchors as plain text.(b) Anchors as titles.

(1) Just adding anchors.(2) Removing outlinks.(3) Removing stopwords.

Page 45: An improved fuzzy system for representing web pages in Clustering Tasks

Criteria Beyond the DocumentRepresentation Avg. F S. D.

SODP

FCC MFT 0.242 0.028

EFCC MFT 0.275 0.025

EFCC a-1 MFT 0.268 0.027

EFCC a-2 MFT 0.267 0.024

EFCC a-3 MFT 0.276 0.022

EFCC b-1 MFT 0.277 0.015

EFCC b-2 MFT 0.270 0.016

EFCC b-3 MFT 0.267 0.012

• The best case using anchor texts gets similar results than EFCC MFT.

• Computational cost.

45/83

EFCC Vs. EFCC + Anchor texts

Page 46: An improved fuzzy system for representing web pages in Clustering Tasks

Criteria Beyond the DocumentRepresentation Avg. F S. D.

SODP

FCC MFT 0.242 0.028

EFCC MFT 0.275 0.025

EFCC a-1 MFT 0.268 0.027

EFCC a-2 MFT 0.267 0.024

EFCC a-3 MFT 0.276 0.022

EFCC b-1 MFT 0.277 0.015

EFCC b-2 MFT 0.270 0.016

EFCC b-3 MFT 0.267 0.012

• The best case using anchor texts gets similar results than EFCC MFT.

• Computational cost.

• FCC performs worse than EFCC in this collection also.

46/83

EFCC Vs. EFCC + Anchor texts

Page 47: An improved fuzzy system for representing web pages in Clustering Tasks

Table of Contents

47/83

1. Introduction2. Web Page Representation and Fuzzy Logic3. Adjusting the representation

1. Analyze Data Distributions2. Tune Membership Functions

4. Test Scenario: Taxonomy Learning5. Conclusions & Outlook6. Publications

Page 48: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation• Some dataset characteristics could influence the way of

defining the Fuzzy Rule Based System.• Document information is captured by means of membership

functions.• Should these functions be modified depending on the

dataset?

Example of membership functions associated to Frequency Linguistic Variable.

48/83

Page 49: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation• The inputs are frequency values in different criteria.

Frequency 49/83

Page 50: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation• The inputs are frequency values in different criteria.

Emphasis 50/83

Page 51: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation• The inputs are frequency values in different criteria.

Titles 51/83

Page 52: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation• Long tails could lead to consider High only maximum values

(with the original fuzzy sets).• Low values compressed at the left side.

• We believe that High or Low should be relative values.• High or Low should depend on the distribution.• Symmetrical sets are appropriate for uniformly distributed

values.

• Input data patterns are not always the same input capture process should not be always the same.

52/83

Page 53: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation

• Grant at least 1 value for each interval: 1/5

53/83

Low

Page 54: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation

Low Medium

• Grant at least 1 value for each interval: 1/5

• Equidistant percentiles for the rest of the intervals.

54/83

Page 55: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation

Low Medium High

• Grant at least 1 value for each interval: 1/5

• Equidistant percentiles for the rest of the intervals.

55/83

Page 56: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation

Low Low

56/83

Page 57: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation

LowMedium High Low Medium High

57/83

Page 58: An improved fuzzy system for representing web pages in Clustering Tasks

Low

Adjusting the Representation• Titles have a small number of possible values.• We try to establish the sets to allow at least one value in each

interval when it is possible.

58/83

Page 59: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the Representation• Titles have a small number of possible values.• We try to establish the sets to allow at least one value in each

interval when it is possible.• We use the lowest value of the distribution for the low set and

divide the rest in equidistant percentiles.

59/83

Low High

Page 60: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the RepresentationRepresentation Avg. F S. D.

Banksearch

TF-IDF MFT 0.748 0.028

FCC MFT 0.756 0.019

EFCC MFT 0.760 0.014

AFCC MFT 0.770 0.016

WebKB

TF-IDF MFT 0.460 0.051

FCC MFT 0.469 0.009

EFCC MFT 0.532 0.032

AFCC MFT 0.565 0.025

Representation Avg. F S. D.

SODP

TF-IDF MFT 0.293 0.030

FCC MFT 0.242 0.028

EFCC MFT 0.275 0.024

AFCC MFT 0.272 0.023

60/83

AFCC Vs. EFCC, FCC & TF-IDF

Page 61: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the RepresentationRepresentation Avg. F S. D.

Banksearch

TF-IDF MFT 0.748 0.028

FCC MFT 0.756 0.019

EFCC MFT 0.760 0.014

AFCC MFT 0.770 0.016

WebKB

TF-IDF MFT 0.460 0.051

FCC MFT 0.469 0.009

EFCC MFT 0.532 0.032

AFCC MFT 0.565 0.025

Representation Avg. F S. D.

SODP

TF-IDF MFT 0.293 0.030

FCC MFT 0.242 0.028

EFCC MFT 0.275 0.024

AFCC MFT 0.272 0.023

• AFCC get the best results in two datasets.

61/83

AFCC Vs. EFCC, FCC & TF-IDF

Page 62: An improved fuzzy system for representing web pages in Clustering Tasks

Adjusting the RepresentationRepresentation Avg. F S. D.

Banksearch

TF-IDF MFT 0.748 0.028

FCC MFT 0.756 0.019

EFCC MFT 0.760 0.014

AFCC MFT 0.770 0.016

WebKB

TF-IDF MFT 0.460 0.051

FCC MFT 0.469 0.009

EFCC MFT 0.532 0.032

AFCC MFT 0.565 0.025

Representation Avg. F S. D.

SODP

TF-IDF MFT 0.293 0.030

FCC MFT 0.242 0.028

EFCC MFT 0.275 0.024

AFCC MFT 0.272 0.023

• AFCC gets the best results in two datasets.

• AFCC gets always comparable or better results than FCC and EFCC.

62/83

AFCC Vs. EFCC, FCC & TF-IDF

Page 63: An improved fuzzy system for representing web pages in Clustering Tasks

Table of Contents

1. Introduction2. Web Page Representation and Fuzzy Logic3. Adjusting the representation4. Test Scenario: Taxonomy Learning

1. Hierarchical Clustering2. Two Languages

5. Conclusions & Outlook6. Publications

63/83

Page 64: An improved fuzzy system for representing web pages in Clustering Tasks

Test Scenario• To try to build a taxonomy from a set of text documents from

Wikipedia.

64/83

Page 65: An improved fuzzy system for representing web pages in Clustering Tasks

Test Scenario• Input: Comparable corpora in English and Spanish (documents

about animals).

65/83

Page 66: An improved fuzzy system for representing web pages in Clustering Tasks

Test Scenario• Algorithm: SOM

66/83

Page 67: An improved fuzzy system for representing web pages in Clustering Tasks

Test Scenario• Taxonomic F-measure.• Labeling process:• Infer concept names from majority of child nodes.• When more than one node is selected to be labeled the same,

they are merged if siblings…

67/83

Page 68: An improved fuzzy system for representing web pages in Clustering Tasks

Test Scenario• Taxonomic F-measure.• Labeling process:• Infer concept names from majority of child nodes.• When more than one node is selected to be labeled the same,

they are merged if siblings…• or the smaller one remains as unclassified in other case.

68/83

Page 69: An improved fuzzy system for representing web pages in Clustering Tasks

Test Scenario• English results:

69/83

Taxo

nom

ic F

-mea

sure

Dimensions

Page 70: An improved fuzzy system for representing web pages in Clustering Tasks

Test Scenario• Spanish results:

70/83

Taxo

nom

ic F

-mea

sure

Dimensions

Page 71: An improved fuzzy system for representing web pages in Clustering Tasks

Table of Contents

1. Introduction2. Web Page Representation and Fuzzy Logic3. Adjusting the representation4. Test Scenario: Taxonomy Learning5. Conclusions & Outlook6. Publications

71/83

Page 72: An improved fuzzy system for representing web pages in Clustering Tasks

Conclusions• To study a fuzzy model to represent HTML documents for

clustering:• To propose a lightweight dimension reduction method focused on

the weighting function.• To propose alternatives to improve the system (EFCC, addFCC).• To explore new criteria to be used (IDF, anchor texts).• To compare our results with the previous FRBSs and TF-IDF.

• MFT obtained results comparable to LSI when used with EFCC.• EFCC improved the system by changing the way in which the

rules were defined to simplify the system and avoid rules that did not work.

• IDF and Anchor Texts did not contribute to improve results.• EFCC achieved good performance in all datasets.

72/83

Page 73: An improved fuzzy system for representing web pages in Clustering Tasks

Conclusions• To adjust the system to concrete datasets:• To analyze the frequency distributions of terms within each

criterion.• To propose a way of tuning the basic parameters of the

membership functions in an automated way. • To evaluate the results compared to previous FRBS and TF-IDF.

• We found different term distributions among datasets: tuning the information capture process seems to make sense.

• Cases that do not follow a power law seems to be better candidates to improve results by FRBS tuning.

• The tuned system is based on dataset statistics only.• Tuning the system is a feasible way of improving the

representation.

73/83

Page 74: An improved fuzzy system for representing web pages in Clustering Tasks

Conclusions• Evaluation of our proposals in a test scenario.• Taxonomy learning problem through hierarchical clustering.• Different algorithm.• Different evaluation method.• Comparable corpora written in English and Spanish.

• Fuzzy logic based alternatives improved TF-IDF in English.• For Spanish, the results were closer. Probably the stemming

process affects the behavior of the representation.• Our results validate the usefulness of FRBSs for representing

documents in clustering tasks. 74/83

Page 75: An improved fuzzy system for representing web pages in Clustering Tasks

Conclusions

• Globally in this thesis:• Fuzzy logic showed its appropriateness to be used as a tool to

declare the knowledge in an easy and understandable way to represent web pages.

• Some contexts where our proposals could achieve good results have been identified.

75/83

Page 76: An improved fuzzy system for representing web pages in Clustering Tasks

Future Directions• To study the effect of non-linear scaling factors over the fuzzy

sets.• To explore whether partial clustering solutions could be used

for tuning the system.• To study new criteria to include in the combination.• Would it be possible to learn the rule set from examples?• To apply this kind of fuzzy approaches to combine profiles in

company name filtering on Twitter.

76/83

Page 77: An improved fuzzy system for representing web pages in Clustering Tasks

Future Directions• To study the effect of non-linear scaling factors over the fuzzy

sets.• To explore whether partial clustering solutions could be used

for tuning the system.• To study new criteria to include in the combination.• It would be possible to learn the rule set from examples?• To apply this kind of fuzzy approaches to combine profiles in

company name filtering on Twitter.

77/83

Page 78: An improved fuzzy system for representing web pages in Clustering Tasks

Table of Contents

1. Introduction2. Web Page Representation and Fuzzy Logic3. Adjusting the representation4. Test Scenario: Taxonomy Learning5. Conclusions & Outlook6. Publications

78/83

Page 79: An improved fuzzy system for representing web pages in Clustering Tasks

Publications• Peer-reviewed Conferences (I):• Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2008. Web

Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps. In Proceedings of Web Intelligence 2008, International Conference on Web Intelligence and Intelligent Agent Technology (IEEE/WIC/ACM). Volume 1, Page(s): 851 - 854. Sydney, Australia. Acceptance Rate: 20%[8 citations]

• Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Sini Pessala, and Timo Honkela. 2010. Learning taxonomic relations from a set of text documents. In Proceedings of AAIA’10 , the 5th International Symposium Advances in Artificial Intelligence and Applications . Page(s): 105 - 112. Wisla, Poland.International Fuzzy Systems Association Award for Young Scientist.

[2 citations]79/83

Page 80: An improved fuzzy system for representing web pages in Clustering Tasks

Publications• Peer-reviewed Conferences (II):• Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2012. Fuzzy

Combinations of Criteria: An Application to Web Page Representation for Clustering. In Proceedings of CICLing 2012, the 13th International Conference on Intelligent Text Processing and Computational Linguistics. Pages(s): 157 - 168. New Delhi, India.Acceptance Rate: 28.6%

• Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2012. Fitting Document Representation to Specific Datasets by Adjusting Membership Functions. In Proceedings of FUZZ-IEEE 2012, the IEEE International Conference on Fuzzy Systems. Brisbane, Australia.ERA A

80/83

Page 81: An improved fuzzy system for representing web pages in Clustering Tasks

Publications• Journals:• Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez. 2009. Una

Representación Basada en Lógica Borrosa para el Clustering de páginas web con Mapas Auto-Organizativos. Procesamiento del Lenguaje Natural, vol. 42, Pages 79 - 86.FECYT Quality Seal for Scientific Spanish Journals. Spanish Foundation for Science and Technology.

• Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez and Timo Honkela. 2012. Learning a taxonomy from a set of text documents. Applied Soft Computing. Volume 12, Issue 3, Pages 1138 - 1148, March 2012.2011 JCR Impact Factor = 2.612. [6 CITATIONS]Ranked Q1 in Computer Science, Artificial Intelligence and Computer Science, Interdisciplinary Applications. 81/83

Page 82: An improved fuzzy system for representing web pages in Clustering Tasks

Publications• Workshops:• Agustín D. Delgado Muñoz, Raquel Martínez, Alberto Pérez García-Plaza

and Víctor Fresno. 2012. Unsupervised Real-Time Company Name Disambiguation in Twitter. In Proceedings of the ICWSM-12 Workshop on Real-Time Analysis and Mining of Social Streams, 6th International AAAI Conference on Weblogs and Social Media. Page(s): 25 - 28. Dublin, Ireland.

82/83

Page 83: An improved fuzzy system for representing web pages in Clustering Tasks

83/83

IF People IS Here AND Talk IS Done THENSlide IS

Thank You!