"semantic similarity & taxonomic distance: using structured metadata in data science...

Andrew Clegg Data Natives, Berlin, 2016

Semantic Similarity and Taxonomic Distance:

Using Structured Metadata in Data Science Models

Semantic Similarity:

Some Uses

7

digital_prints music_and_movie_posters (0.84)

digital_prints digital_prints (1.00)

digital_prints lithographs (0.79)

lens_cases lens_cases (1.00)

lens_cases camera_cases (0.92)

lens_cases laptop_bags (0.77)

True Label Prediction / Score

8

Before After

Measuring Semantic Similarity

10

All Items

Shoes

Boots Sneakers & Athletic Shoes

Hi Tops

Sandals

SkatesTie Sneakers

Path Length

sim(node1, node2) = 1� len(node1, node2)

2⇥max depth

home_and_living 1

kitchen_and_dining 2

cookware 3

pots_and_pans 4

pans 5

skillets 6

Node Depth

12

All Items

Shoes


Hi Tops

Sandals

SkatesTie Sneakers

Wu & Palmer 1994

sim(node1, node2) =2⇥ depth(ancestor)

len(node1, node2) + 2⇥ depth(ancestor)

All Items

Shoes


Hi Tops

Sandals

SkatesTie Sneakers

Sussna 1993

0.17

0.17

0.11

dist(parent, child) =1� 1÷ num children(parent)

2⇥ depth(child)

Information- Based Methods

15

How frequent is this node or any of its descendants in your data?

Information Content

16

I(node) = � log P(node)| {z }

All Items

Shoes


Hi Tops

Sandals

SkatesTie Sneakers

Resnik 1995

sim(node1, node2) = � log P(ancestor)

P(shoes) = 0.14 -log P(shoes) = 2.83

All Items

Shoes


Hi Tops

Sandals

SkatesTie Sneakers

Lin 1998

sim(node1, node2) =2⇥ log P(ancestor)

log P(node1) + log P(node2)

P(shoes) = 0.14 -log P(shoes) = 2.83

P(boots) = 0.04 -log P(boots) = 4.64

P(tie sneakers) = 0.03 -log P(tie sneakers) = 5.06

Which Method Wins?

19

Thanks!See this paper for all the references: Budanitsky &

Hearst, Computational Linguistics 32 (1), 2006.

Find me on Twitter: @andrew_clegg

PS We’re hiring! https://www.etsy.com/careers/

https://www.etsy.com/careers/

"semantic similarity & taxonomic distance: using structured metadata in data science...

Data & Analytics