a comparison of implicit and explicit links for web page classification dou shen 1 jian-tao sun 2...
TRANSCRIPT
![Page 1: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/1.jpg)
A Comparison ofImplicit and Explicit Links
for Web Page Classification
Dou Shen1 Jian-Tao Sun2 Qiang Yang1 Zheng Chen2
1Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology, Hong Kong
2Microsoft Research Asia, China
![Page 2: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/2.jpg)
Outline
Introduction Related Work Implicit and Explicit Links Links for Classification Experiments Conclusion and Future Work
![Page 3: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/3.jpg)
Introduction
Why we need Web page classification? Organize the growing amount of pages Facilitate other text mining applications
How to classify Web pages? Classification algorithm (SVM, NB, KNN…) Web page representation
![Page 4: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/4.jpg)
Introduction (Cont.) Web page representation
Content Based Utilize words or phrases of a target page However, very often a Web page contains enough
textual clues Context Based
Leverage hyperlinks to connect pages It works. However, the hyperlinks sometimes may not
reflect true relationships in content between Web pages
Any other kind of linkages can be defined and used?
How to improve classification with the new links?
![Page 5: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/5.jpg)
Related Work
Exploiting Hyperlinks Chakrabarti et al. used predicted labels of neighboring
documents to reinforce classification decisions for a given document;
Furnkranz also reported a significant improvement in classification accuracy when using the link-based method as opposed to the full-text alone.
Exploiting Query Logs Beeferman and Berger proposed an innovative query clustering
method based on query log; Xue et al. proposed a novel categorization algorithm named IRC
to categorize the interrelated Web objects by leveraging query
log.
![Page 6: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/6.jpg)
Implicit and Explicit Links
Query logs
Person 1
“SIGIR”
√
√
√
![Page 7: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/7.jpg)
Implicit and Explicit Links (Cont.) Implicit link 1 ( LI1)
Assumption: a user tends to click the pages related to the issued query;
Definition: there is an LI1 between d1 and d2 if they are clicked by the same person through the same query;
Implicit link 2 (LI2) Assumption: users tend to click related
pages according to the same query Definition: there is an LI2 between d1 and
d2 if they are clicked according to the same query
![Page 8: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/8.jpg)
Implicit and Explicit Links (Cont.)
Comparison between IL1 and IL2 The constraint of LI2 is not as strict as
that for LI1; Thus, there are more links of LI2 can
be constructed than LI1; LI2 is noisier than LI1, especially for
the ambiguous queries ( such as “apple”)
![Page 9: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/9.jpg)
Implicit and Explicit Links (Cont.) Three kinds of Explicit Links
defined based on hyperlinks
CondE1: there exists hyperlinks from dj to di, (In-Link to di from dj)
CondE2: there exists hyperlinks from di to dj, (Out-Link from di to dj)
CondE3: either CondE1 or CondE2 holds
![Page 10: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/10.jpg)
Test data Training data
d1
d2
Links for Classification Classification by Linking Neighbors (CLN)
CLN is similar to KNN; K is not a constant as in
KNN and it is decided bythe set of the neighborsof the target page.
![Page 11: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/11.jpg)
Links for Classification (Cont.) Build Virtual Document
Given a document, the virtual document is constructed by borrowing some Extra Text from its neighbors
Extra Text Local Text: Plain text + Meta Data Anchor Text Extended Anchor Text Anchor Sentence
Apply any classifier such as SVM, NB
![Page 12: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/12.jpg)
Links for Classification (Cont.)
Local Text: Plain text: remaining text by removing html tags; Meta Data: text between <Meta> and </Meta>;
Anchor Text The visible text in a hyperlink
Extended Anchor Text The set of rendered words occurring up to 25 words
before and after an associated link Anchor Sentence
The set of sentences containing the query based on which the implicit link is created
![Page 13: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/13.jpg)
Experiments Datasets
1.3 million Web pages among 424 classes from Open Directory Project (ODP)
44.7 million records in 29 days from MSN Classifiers
Naïve Bayesian Classifier; Support Vector Machine (SVMlight)
Evaluation Metrics Precision, Recall, F1
![Page 14: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/14.jpg)
Experiments (Cont.) Statistics of Links Consistency:
the percentage of links that have the two linked pages from the same category.
The consistency of LI1 is much higher than others;
The consistency values of all explicit links are lower than 50%, which explained some published results that it is not helpful to use hyperlink in a straightforward way;
#LE1 = #LE2 > #LE3 A→B; B→C; C→B #LE1 = 3; #LE2 = 3; #LE3 = 2
![Page 15: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/15.jpg)
Experiments (Cont.)
The results are consistent with the consistency values of different kinds of links
Compare the best result of implicit links and the best result of explicit links
0
0.1
0.2
0.3
0.4
0.5
0.6
LI1 LI2 LE1 LE2 LE3
Micro-F1 Macro-F1
20.6%
44.0%
Results of CLN on Different Links
![Page 16: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/16.jpg)
Experiments (Cont.) Construction of virtual documents
dm
dn
dk
LE1
LE1
VD(dk) === LT(dm)+LT(dn)ELT
VD(dk) === AT(dm)+AT(dn)AT
VD(dk) === EAT(dm)+EAT(dn)EAT
dm
dn
dk
LI1
LI1
VD(dk) === LT(dm)+LT(dn)ILT
VD(dk) === AS(dm)+AS(dn)AS
![Page 17: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/17.jpg)
Experiments (Cont.) Performance on different kinds of VD
The performance of AS, EAT and AT is just as good as the baseline, or even worse.
ILT is much better than ELT
ELT is better than LT, but not always
![Page 18: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/18.jpg)
Experiments (Cont.)
Explanation the average size of the virtual
documents (in terms of KB)
the consistency or purity of the content of the virtual documents
![Page 19: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/19.jpg)
Experiments (Cont.) Effect of Different Combinations
![Page 20: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/20.jpg)
Experiments (Cont.) Observations
Either AT, EAT or AS can improve the performance of classification;
AS achieves greatest improvement; Different weighting schemes do not
make too much of a difference We also tried to combine LT,EAT and
AS together, no further improvement is obtained
![Page 21: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/21.jpg)
Experiments (Cont.)
The effect of Query Log quantity
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Micro-F1(NB) Macro-F1(NB)
Micro-F1(SVM) Macro-F1(SVM)
![Page 22: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/22.jpg)
Conclusion Based on the query logs, a new kind of
links--- the implicit links -- is introduced; Comparison between the implicit and
explicit links on a large dataset is given; A concept of a virtual document by
extracting “anchor sentence (AS)” though implicit links is presented;
Experiment result show that implicit link is better than explicit when used for web page classification.
![Page 23: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/23.jpg)
Future Work
Introduce more kinds of implicit and explicit links;
Try on more applications such as clustering and summarization;
Extract other information such as “Dissimilarity Relationship” from query log.
![Page 24: A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science](https://reader035.vdocument.in/reader035/viewer/2022081602/5514c7e2550346b0338b4c13/html5/thumbnails/24.jpg)
Thanks