growing a tree in the forest: constructing folksonomies by integrating structured metadata anon...
Post on 27-Dec-2015
216 Views
Preview:
TRANSCRIPT
Growing a Tree in the Forest: Constructing Folk-sonomies by Integrating Structured Metadata
Anon Plangprasopchok1, Kristina Lerman1, Lise Getoor2
1USC Information Sciences Institute 2Department of Computer Science, University of Maryland
SIGKDD 2010
2010. 12. 17.
Summarized and Presented by Kim Chung Rim, IDS Lab., Seoul National University
Copyright 2010 by CEBT
Contents
Introduction
Goal
Challenges
Learning Folksonomies
SAP: Growing a Tree
Evaluation
Conclusion & Discussion
2
Copyright 2010 by CEBT
Introduction
The social Web has changed the way people create and use information
Users organize their content via tags
Sites such as Flickr, Del.icio.us and YouTube allows users to tag their contents with descriptive keywords
3
Copyright 2010 by CEBT
Social Metadata
captures collective knowledge of Social Web users
Once mined from many users,
Such collective knowledge will add a rich semantic layer to the content of Social Web
Could be used for information discovery
4
Copyright 2010 by CEBT
Structured Social Metadata
Some Social Websites allow users to organize tags hierarchically
Delicious
Users can group related tagsinto bundles
Flickr
Users can group related photos into sets and related sets into collections
5
Copyright 2010 by CEBT
Folksonomy
Folksonomy – a common taxonomy that is learned from social metadata.
There exist predefined hierarchies such as WordNet or Linnaean classification system.
Benefits of Folksonomy over already existing hierarchies
represent collective agreement of many individuals
Are relatively inexpensive to obtain
Can adapt to evolving vocabularies and community’s in-formation needs
Directly tied to the annotated content
6
Copyright 2010 by CEBT
Goal
The goal of this paper is to
Extract users’ knowledge (folk knowledge) from user-gen-erated semantics
To build up a tree of communal taxonomy
7
animal
fish cat dog
dog
biggle jindo
animal
reptile cat
animal
fish cat dog
biggle jindo
reptile
Copyright 2010 by CEBT
Challenges
Sparseness
Social metadata is very sparse – only 3.74 tags per photo
Nosiy vocabulary
Spelling errors and meaningless tags – not sure, mykid
Ambiguity
Individual tag is often ambiguous – jaguar
8
Copyright 2010 by CEBT
Challenges
Structural noise and conflicts
Users use different term to build their own hierarchy
Varying granularity level
Users’ level of expertise and expressiveness differ
9
Copyright 2010 by CEBT
Learning Folksonomies - intuition
roots with similar term should be aggregated horizontally
A Root and a leaf with similar term should be aggregated vertically
10
animal
fish cat dog
animal
reptile cat
animal
fish cat dog reptile
animal
fish cat dog
dog
biggle jindo
animal
fish cat dog
biggle jindo
Copyright 2010 by CEBT
Learning Folksonomy
Relational Clustering
Two nodes should be clustered if their similarity is above a certain threshold
Similarity is computed using local similarity & structural similarity
– localSim(a,b) measures the similarity of node names and their tag distribution
– structSim(a,b) measures the similarity of leaves
– is a weight for adjusting contributions from localSim and structSim ( )
11
),(),()1(),( bastructSimbalocalSimbanodesim
10
Copyright 2010 by CEBT
Learning Folksonomy - terms
A personal hierarchy is called a sapling
A sapling is composed of a root node and its leaves
Tag statistic of a node x is defined as
Where and are tag and its frequency
is aggregated from all
= {(canada, 1), (vacation, 1), … }
= {(bc,1), (canada, 2), (vacation, 1), … }
12
ir ij
i ll ,...,1
)},(),...,,(),,{(:21 21 ktkttx ftftft
kt ktf
ir i
jl
vicl1
vicr
Copyright 2010 by CEBT
Local Similarity
The local similarity of nodes a and b is composed of
Name similarity
Tag distribution similarity
nameSim can be any string similarity metric, which returns a value from 0 to 1
tagSim can be any function for measuring the similarity of distributions – in this paper, tagSim counts the number of common tags n, in the top K tag of a and b. if n ≥ J (a cer-tain threshold), it returns 1, else it returns n/J.
13
),()1(),(),( batagSimbanameSimbalocalSim
Copyright 2010 by CEBT
Structural Similarity
Compute structural similarity in two cases
Similarity between two root nodes
Similarity between a root node and a leaf node
14
Copyright 2010 by CEBT
Root-to-Root Similarity
Similarity based on
How many of their children have common name
The tag distribution similarity of those that do not have the same name
– Where ,
– is an aggregation of tag distributions of all
15
),()1(),( Btag
Atag
BA LLtagSimCLCLrrRstructSimR
))(),((1
,ji
Bi
Ai lnamelname
ZCL |)||,min(| YX llZ
AtagL
Ail
Copyright 2010 by CEBT
Root-to-Leaf Similarity
Takes into account when a sapling contains the same leaf as the compared node
When merging UK->{Scotland, Glasgow, Edinburgh, Lon-don} and Scotland->{Glasgow, Shetland}, it can be said that Scotland and UK are similar because UK contains a shortcut to Glasgow
Measures overlap between siblings of and children of
16
),(),( BABAi rrRstructSimRrlRstructSimL
Ail
Br
Copyright 2010 by CEBT
Relational Clustering
When the nodeSim(a,b)
exceeds a certain threshold, two nodes are clustered(merged)
17
),(),()1(),( bastructSimbalocalSimbanodesim
animal
fish cat dog
animal
reptile cat
animal
fish cat dog reptile
Copyright 2010 by CEBT
SAP: Growing a Tree
Using Incremental Relational Clustering,
18
canada
canada canada
canada
canada
victoria
victoria
victoria victoria
victoria victoria
Horizontal aggrega-
tion
Vertical ag-gregation
Copyright 2010 by CEBT
Handling Problems
Shortcuts
Take the longer route
Noisy vocabularies
If a name is used by less than 1% of the user who contrib-uted in the sapling, it is considered as noisy and discarded
19
Copyright 2010 by CEBT
Evaluation
20,759 saplings from 7,121 users are selected from Flickr
Values for several parameters are chosen
20
Parame-ters
Value
K 40
J 4
RR 0.1
LR 0.8 0.5
Copyright 2010 by CEBT
Evaluation Methodologies
Against the reference hierarchy (ODP)
Taxonomic Overlap – measuring structure similarity be-tween two trees. Measures how many ancestor and de-scendant nodes overlap to those in the reference tree for each node and take the harmonic mean (fmTO)
Lexical Recall – measuring how well an approach can dis-cover concepts in the reference hierarchy (LR)
Structural evaluation
Area Under Tree – a measurement to define how bushy and deep a tree is (AUT)
Manual evaluation (for concepts not in ODP)
Asking 5 judges whether the path is correct
21
Copyright 2010 by CEBT
Evaluation metric - AUT
AUT for below tree is:
1*(3+1)/2 + 1*(3+2)/2 = 4.5
22
animal
fish cat dog
biggle jindo0 0.5 1 1.5 2
0
1
2
3
nodes per depth
Copyright 2010 by CEBT
Baseline of comparison
SIG – past work of the authors
Assume nodes having the same name refer to the same concept
Keep the relations between node pairs
Combine all relations into a tree
23
Copyright 2010 by CEBT
Conclusion & Discussion
This work can create more accurate and more detailed folksonomies than the current state-of-the-art approach by exploiting structural information
This work is more scalable – it incrementally grows the folksonomies
Lacks comparison with other state-of-the-art approaches
27
top related