classifying tags using open content resources
DESCRIPTION
Classifying Tags Using Open Content Resources. Simon Overell , Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09. Motivation. Classify tags in Flickr as broad categories such as what , where , when and who Easier indexing and navigation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/1.jpg)
Classifying Tags Using Open Content ResourcesSimon Overell, Borkur Sigurbjornsson & Roelof van Zwol
WSDM ‘09
![Page 2: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/2.jpg)
Motivation Classify tags in Flickr as broad categories
such as what, where, when and who Easier indexing and navigation WordNet is usually used for
classification but has limited coverage
![Page 3: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/3.jpg)
Example
![Page 4: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/4.jpg)
The ClassTag System
![Page 5: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/5.jpg)
Classifying Wikipedia Articles Using only metadata (i.e. Categories and
Templates) – high scalability Supervised Classifier
Articles as objects WordNet noun semantic categories as
classification classes Categories and Templates as features
Support Vector Machine (SVM) as classifier
![Page 6: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/6.jpg)
Categories and Templates
![Page 7: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/7.jpg)
Categories and Templates
![Page 8: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/8.jpg)
Supervised Classification Ground Truth
All Wikipedia articles that match WordNet nouns
Data Sparsity WordNet categories under represented
(10 out of 25) Articles have very few features
![Page 9: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/9.jpg)
Reducing Data Sparsity Using category and
template network transclusion
… but noise is added
![Page 10: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/10.jpg)
System Optimization Number of arcs traversed in
Category network Template network
Choice of weighting function Term Frequency (tf) Term Frequency – Inverse Document
Frequency (tf-idf) Term Frequency – Inverse Layer (tf-il)
![Page 11: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/11.jpg)
Example
![Page 12: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/12.jpg)
Fine Tuning Partitioned the ground truth into training
and test sets Criteria
At least 80% precision Maximum possible recall
Resulted optimal values Category arcs: 3, Template arcs: 3, TF-IL Precision: 87% F1-Measure:0.696
![Page 13: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/13.jpg)
SVM Threshold SVM outputs confidence with which an
article is correctly classified as a member of a category
Training experiment with 250 Wikipedia articles (1 assessor)
![Page 14: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/14.jpg)
SVM Threshold
![Page 15: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/15.jpg)
SVM Threshold
![Page 16: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/16.jpg)
Summary Optimised for Recall (ClassTag)
39% of Articles classified 664,770 Wikipedia articles
Optimised for Precision (ClassTag+) 21% of Articles classified 338,061 Wikipedia articles
![Page 17: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/17.jpg)
Comparison with DBpedia• Experimental Setup
– 300 pooled articles– 3 Assessors– Blind Assessments– 50 articles overlap
• Partial Agreement:– 86%
• Total Agreement:– 78%
![Page 18: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/18.jpg)
Results
![Page 19: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/19.jpg)
Classification of Flickr Tags Tag Anchor Text
String matching Anchor Text Wikipedia Article
Number of times an anchor refers to a Wikipedia article
Wikipedia Article Category Output of SVM decision
![Page 20: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/20.jpg)
Ambiguity Tag Anchor Text
Some ambiguity because often tags are lower case with no white spaces
Anchor Text Wikipedia Article 13.4% of Anchor text -> Wikipedia Article mappings
ambiguous 4% of Anchor text -> Category mappings ambiguous Example
George Bush -> George W. Bush, George Bush Senior George Bush -> Person
Wikipedia Article Category 5.7% of classified articles result in multiple classification
![Page 21: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/21.jpg)
Example
![Page 22: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/22.jpg)
Evaluation WordNet classification extended
vocabulary coverage by 115% Taking tag frequency into account
ClassTag classified 69.2% of Flickr tags 22% more than WordNet baseline
![Page 23: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/23.jpg)
Tag distribution
![Page 24: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/24.jpg)
Multilanguage Classification 80% of tags in English, 7% in German
and 6% in Dutch Maybe a portion of the unclassified tags
fall into this category Possible alternate language classification
Run ClassTag using alternate Wikipedia language and a corresponding lexicon
Translate the English classification using Wikipedia’s interlanguage links
![Page 25: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/25.jpg)
Contributions Classifying open content resources
using their structural patterns Presenting ClassTag - a system for
classifying tags ClassTag extends the WordNet lexicon
using the structural patterns of Wikipedia
![Page 26: Classifying Tags Using Open Content Resources](https://reader034.vdocument.in/reader034/viewer/2022051518/56816768550346895ddc4dfb/html5/thumbnails/26.jpg)
Conclusion Tuneable system for classifying
Wikipedia pages ClassTag: Nearly 40% of articles classified
with a precision of 72% ClassTag+: 21% of articles classified with
a precision of 86% (equal to assessor agreement)
Nearly 70% of Flickr tags matched to WordNet categories