evidence of quality of textual features on the web 2.0
DESCRIPTION
Evidence of Quality of Textual Features on the Web 2.0. Flavio Figueiredo,Fabiano Belém,Henrique Pinto,Jussara Almeida,Marcos Gonçalves,David Fernandes,Edleno Moura,Marco Cristo ( CIKM 2009 ) Advisor: Dr. Jai-Ling Koh Speaker: Shun-hong Sie Date : 2009/12/21. Outline. Introduction - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/1.jpg)
Evidence of Quality of Textual Features on the Web 2.0Flavio Figueiredo,Fabiano Belém,Henrique Pinto,Jussara Almeida,Marcos Flavio Figueiredo,Fabiano Belém,Henrique Pinto,Jussara Almeida,Marcos
Gonçalves,David Fernandes,Edleno Moura,Marco CristoGonçalves,David Fernandes,Edleno Moura,Marco Cristo((CIKM 2009CIKM 2009))
Advisor: Dr. Jai-Ling KohSpeaker: Shun-hong Sie
Date : 2009/12/21
![Page 2: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/2.jpg)
2
Outline
• Introduction• Related Work• Data collection• Feature Characterization• Object Classification• Conclusions and Future work
![Page 3: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/3.jpg)
3
Introduction
• Web 2.0 applications have grown significantly in diversity and popularity in recent years.
• Social media is here used to refer to the content, most commonly generated by users, available in Web 2.0 applications.
• Being often unsupervised sources of data, social media offers no guarantee of quality, thus posing a challenge to information retrieval (IR) services such as search, recommendation, and online advertising.
![Page 4: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/4.jpg)
4
Introduction
• Tagging has emerged as an effective form of describing and organizing content on the Web 2.0, being a main topic of research in the area.
• The effectiveness of tags as supporting information for IR services is questionable.
• The term quality is here used to refer to the feature’s potential effectiveness as supporting information for IR services.
![Page 5: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/5.jpg)
5
Last.fm
![Page 6: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/6.jpg)
6
Youtube.com
![Page 7: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/7.jpg)
7
Citeulike.org
![Page 8: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/8.jpg)
8
Yahoo video
![Page 9: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/9.jpg)
9
Textual Features on the Web 2.0
• Textual features may be categorized according to the level of user collaboration allowed by the application.
• In particular, the textual features analyzed can be either collaborative(C) or restrictive(R).
• In this paper, we focus on title, tags, comments, description.
![Page 10: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/10.jpg)
10
Main findings• All four features but title are unexplored in a non-
negligible fraction of the collected objects, particularly in applications that allow for the collaborative creation of the feature content;
• The amount of content tends to be larger in collaborative features;
• The typically smaller and more often used title and tags features exhibit higher descriptive and discriminative power, followed by description and comments;
• There is non-negligible diversity of content across features associated with the same object, indicating that features may bring different pieces of information about it.
![Page 11: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/11.jpg)
11
Main findings
• Tags if present, are, in fact, the most promising feature in isolation;
• Combining content from all four features can improve the classification results due to the presence of distinct but somewhat complementary content;
• For classification, feature quality is affected by the amount of content, being title the one with lowest quality, despite its good discriminative/ descriptive power and large object coverage.
![Page 12: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/12.jpg)
12
Related Work
• Comparing tags with other textual features found on Web 2.0 applications is still an open issue.
• The author concludes that tags are not the best textual feature, since title and descriptions better capture the context within the picture and tags are used, mostly, for organizational purposes.
• This result is contradictory to most previous work on tags.
![Page 13: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/13.jpg)
13
Data collection• Follow a snowball sampling strategy.• Each crawler starts with a number of objects as
seeds(at least 800).• In Youtube and YahooVideo, the seeds were randomly
selected from the list of all-time most popular videos provided by the applications.
• In LastFM, they were selected among the artists associated with the most popular tags7.
• For CiteULike, which does not provide explicit related object links but makes daily snapshots of stored objects available for download, we collected a random sample of one snapshot.
![Page 14: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/14.jpg)
14
www.allmusic.com
![Page 15: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/15.jpg)
15
Feature Characterization
![Page 16: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/16.jpg)
16
Feature Characterization
![Page 17: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/17.jpg)
17
Data processing
• Focus on the English language, applying a simple filter that disregards objects with less than three English stop-words in their textual features.
• We also focus our analysis on non-empty feature instances.
![Page 18: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/18.jpg)
18
Data processing
![Page 19: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/19.jpg)
19
Descriptive and Discriminative Power
• Web 2.0 features normally define different content blocks in Web pages.
• For instance, in Youtube, the comments about a video are placed near to each other into a common (comments) block.
![Page 20: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/20.jpg)
20
Descriptive Power
• Average Feature Spread a heuristic metric.• Let o be an object in the collection O, and t a
term which appears in at least one feature instance f associated with o. The term spread, TS(t, o), measures the number of feature instances associated with o which contain t.
![Page 21: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/21.jpg)
Average Feature Spread
21
Feature
Collection(Document 、
web pages)
o
Collection(Document 、
web pages)Collection
(Document 、web pages)
Collection(Document 、
web pages)
Collection(Document 、
web pages)Collection(Document 、
web pages)Collection(Document 、
web pages)Collection
(Document 、web pages)
Collection(Document 、
web pages)
Term
![Page 22: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/22.jpg)
22
Descriptive Power
• Feature Instance Spread of a feature instance f associated with an object o, FIS(f, o), as the average term spread across all terms in f.
• The descriptive power of a feature F in the object collection O can be captured by averaging the values of FIS across all objects- Average Feature Spread of F, AFS(F).
![Page 23: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/23.jpg)
23
Discriminative Power
• Use a heuristic called Average Inverse Feature Frequency (AIFF), builds on a small variation of the IDF metric, called Inverse Feature Frequency (IFF).
|VF| the size of the complete vocabulary of feature F
![Page 24: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/24.jpg)
24
Discriminative Power
• The values of AFS and AIFF are used to estimate the Feature Importance (FI), which measures how good a feature is considering both its discriminative and descriptive power.
FI(F) = AFS(F) × AIFF(F)
![Page 25: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/25.jpg)
25
Descriptive and Discriminative Power
![Page 26: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/26.jpg)
26
Descriptive and Discriminative Power
![Page 27: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/27.jpg)
27
Content Diversity• We quantify the similarity between two
feature instances in terms of term co-occurrence using the Jaccard Coefficient.
• Given two sets of items T1 and T2, the Jaccard Coefficient is computed.
![Page 28: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/28.jpg)
28
Content Diversity
![Page 29: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/29.jpg)
29
Summary• All four features but title, have a non-negligible fraction of
empty instances in at least two of the analyzed applications, and thus may not be effective as single source of information for IR services.
• The amount of content tends to be larger in collaborative features.
• The typically smaller and more often used title and tags exhibit higher descriptive and discriminative power, according to our AFS and AIFF heuristics, followed by description and comments.
• There is significant content diversity across features associated with the same object, indicating that each feature may bring different pieces of information about it.
![Page 30: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/30.jpg)
30
Object Classification• Adopt the Vector Space Model.• TS : the weight of a term t in an object o is equal to the spread of t in o, TS(t, o), which heuristically
captures the descriptive power of the term.• TS × IFF : the weight of t is equal to the product TS(t, o)× IFF(t, F), capturing, thus, both descriptive
and discriminative powers of t. • TF : the weight of t is given by the frequency of t in the feature (combination) used to represent the
object.• TF × IFF : the weight is given by the product TF ×IFF.• Title : only terms present in the title of o constitute the vector V , that is, V = Vtitle.
• Tags : V = Vtags, where Vtags is the vector that represents only terms in the tags of o.
• Description : V = Vdesc., where Vdesc. is the vector that represents only terms in the description of o.
• Comments : V = Vcomm, where Vcomm is the vector that represents only terms in the comments of o.
• Bag of Words : all four features are taken as a single document, as done by traditional IR algorithms, which do not consider the block structure in a Web page.
• Concatenation All textual features are concatenated in a single vector, so that the same terms in different features are considered different elements in the vector.
![Page 31: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/31.jpg)
31
Experimental Setup
• Support Vector Machine.
• K-Folder cross-validation.(K=10)
• Classification evaluation metrics: Macro-F1 and Micro-F1.
![Page 32: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/32.jpg)
32
Answer
ResultTrue False
True TP FP
False FN TN
Macro-F1 is defined as the average of the F1 values over all classes, whereas Micro-F1 is computed using values of TP, FP and FN calculated as the sum of respective values for all classes.
FPTP
TPprecision
FNTP
TPrecall
recallprecision
recallprecisionf
2
1
![Page 33: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/33.jpg)
33
Experimental Setup
![Page 34: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/34.jpg)
34
Experimental Result
![Page 35: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/35.jpg)
35
Experimental Result
![Page 36: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/36.jpg)
36
Experimental Summary• Tags, when present, is the best feature in isolation for classification
purposes due to a combination of good descriptive and discriminative power and large amount of content;
• Combination of features may bring some benefits due to the presence of distinct but somewhat complementary content;
• Terms in different feature instances should be considered in their original context in order to preserve their descriptive/discriminative power;
• A combination of fewer classes, with less ambiguity, and more qualified human labelers, made classification more effective in LastFM than in the video applications;
• The use of different weighting schemes does not produce much difference due to inherent properties of the SVM classifier.
![Page 37: Evidence of Quality of Textual Features on the Web 2.0](https://reader033.vdocument.in/reader033/viewer/2022051622/56814f7e550346895dbd2fba/html5/thumbnails/37.jpg)
37
Conclusions
• Collaborative features, including tags, are absent in a non-negligible fraction of objects.
• Collaborative features tend to provide more content than restrictive features, such as title.
• Smaller title and tags have higher descriptive and discriminative power, according to our adapted heuristics, and that there exists significant content diversity across features.
• Combining content from all features can lead to classification improvements.