"semantic similarity & taxonomic distance: using structured metadata in data science...
TRANSCRIPT
Andrew Clegg Data Natives, Berlin, 2016
Semantic Similarity and Taxonomic Distance:
Using Structured Metadata in Data Science Models
Semantic Similarity:
Some Uses
7
digital_prints music_and_movie_posters (0.84)
digital_prints digital_prints (1.00)
digital_prints lithographs (0.79)
lens_cases lens_cases (1.00)
lens_cases camera_cases (0.92)
lens_cases laptop_bags (0.77)
True Label Prediction / Score
8
Before After
Measuring Semantic Similarity
10
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Path Length
sim(node1, node2) = 1� len(node1, node2)
2⇥max depth
home_and_living 1
kitchen_and_dining 2
cookware 3
pots_and_pans 4
pans 5
skillets 6
Node Depth
12
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Wu & Palmer 1994
sim(node1, node2) =2⇥ depth(ancestor)
len(node1, node2) + 2⇥ depth(ancestor)
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Sussna 1993
0.17
0.17
0.11
dist(parent, child) =1� 1÷ num children(parent)
2⇥ depth(child)
Information- Based Methods
15
How frequent is this node or any of its descendants in your data?
Information Content
16
I(node) = � log P(node)| {z }
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Resnik 1995
sim(node1, node2) = � log P(ancestor)
P(shoes) = 0.14 -log P(shoes) = 2.83
All Items
Shoes
Boots Sneakers & Athletic Shoes
Hi Tops
Sandals
SkatesTie Sneakers
Lin 1998
sim(node1, node2) =2⇥ log P(ancestor)
log P(node1) + log P(node2)
P(shoes) = 0.14 -log P(shoes) = 2.83
P(boots) = 0.04 -log P(boots) = 4.64
P(tie sneakers) = 0.03 -log P(tie sneakers) = 5.06
Which Method Wins?
19
Thanks!See this paper for all the references: Budanitsky &
Hearst, Computational Linguistics 32 (1), 2006.
Find me on Twitter: @andrew_clegg
PS We’re hiring! https://www.etsy.com/careers/