"semantic similarity & taxonomic distance: using structured metadata in data science...

20
Andrew Clegg Data Natives, Berlin, 2016

Upload: dataconomy-media

Post on 08-Jan-2017

95 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

Andrew Clegg Data Natives, Berlin, 2016

Page 2: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

Semantic Similarity and Taxonomic Distance:

Using Structured Metadata in Data Science Models

Page 3: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy
Page 4: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy
Page 5: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy
Page 6: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy
Page 7: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

Semantic Similarity:

Some Uses

7

Page 8: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

digital_prints music_and_movie_posters (0.84)

digital_prints digital_prints (1.00)

digital_prints lithographs (0.79)

lens_cases lens_cases (1.00)

lens_cases camera_cases (0.92)

lens_cases laptop_bags (0.77)

True Label Prediction / Score

8

Page 9: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

Before After

Page 10: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

Measuring Semantic Similarity

10

Page 11: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

All Items

Shoes

Boots Sneakers & Athletic Shoes

Hi Tops

Sandals

SkatesTie Sneakers

Path Length

sim(node1, node2) = 1� len(node1, node2)

2⇥max depth

Page 12: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

home_and_living 1

kitchen_and_dining 2

cookware 3

pots_and_pans 4

pans 5

skillets 6

Node Depth

12

Page 13: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

All Items

Shoes

Boots Sneakers & Athletic Shoes

Hi Tops

Sandals

SkatesTie Sneakers

Wu & Palmer 1994

sim(node1, node2) =2⇥ depth(ancestor)

len(node1, node2) + 2⇥ depth(ancestor)

Page 14: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

All Items

Shoes

Boots Sneakers & Athletic Shoes

Hi Tops

Sandals

SkatesTie Sneakers

Sussna 1993

0.17

0.17

0.11

dist(parent, child) =1� 1÷ num children(parent)

2⇥ depth(child)

Page 15: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

Information- Based Methods

15

Page 16: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

How frequent is this node or any of its descendants in your data?

Information Content

16

I(node) = � log P(node)| {z }

Page 17: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

All Items

Shoes

Boots Sneakers & Athletic Shoes

Hi Tops

Sandals

SkatesTie Sneakers

Resnik 1995

sim(node1, node2) = � log P(ancestor)

P(shoes) = 0.14 -log P(shoes) = 2.83

Page 18: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

All Items

Shoes

Boots Sneakers & Athletic Shoes

Hi Tops

Sandals

SkatesTie Sneakers

Lin 1998

sim(node1, node2) =2⇥ log P(ancestor)

log P(node1) + log P(node2)

P(shoes) = 0.14 -log P(shoes) = 2.83

P(boots) = 0.04 -log P(boots) = 4.64

P(tie sneakers) = 0.03 -log P(tie sneakers) = 5.06

Page 19: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

Which Method Wins?

19

Page 20: "Semantic Similarity & Taxonomic Distance: Using Structured Metadata in Data Science Models", Andrew Clegg, Data Scientist at Etsy

Thanks!See this paper for all the references: Budanitsky &

Hearst, Computational Linguistics 32 (1), 2006.

Find me on Twitter: @andrew_clegg

PS We’re hiring! https://www.etsy.com/careers/