fang xu- enriching content with knowledge base by search keywords and wikidata

Post on 16-Apr-2017

422 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

© Searchmetrics. All rights reserved. Do not distribute without permission.

Enriching content with Knowledge Base by Search Keywords and Wikidata

Fang Xuf.xu@searchmetrics.com@allxufang

2

© Searchmetrics. All rights reserved. Do not distribute without permission.

Data Science@Searchmetrics

Data driven search and content optimization marketing

• Learning from keywords

• Content optimization

• Data visualization

3

© Searchmetrics. All rights reserved. Do not distribute without permission.

Looooots of Data

• 120 Million Domains

• 600 Million Keywords

• 120 Billion Links

• 25,000 Billion Social Signals

• 25 PB raw data

4

© Searchmetrics. All rights reserved. Do not distribute without permission.

Authors submit content üRate the content’s effectiveness ü Feedback to optimize and enrich it

Content Production in Real-time

5

© Searchmetrics. All rights reserved. Do not distribute without permission.

Beyond keywords

• Keyword • Typos• Ambiguous• Sparse

• Entity • Augmented with

metadata• Relations among entities

6

© Searchmetrics. All rights reserved. Do not distribute without permission.

Q64

Entity

7

© Searchmetrics. All rights reserved. Do not distribute without permission.

8

© Searchmetrics. All rights reserved. Do not distribute without permission.http://brendangriffen.com/blog/gow-programming-languages

Knowledge Base (KB)

9

© Searchmetrics. All rights reserved. Do not distribute without permission.

20012012

20142008

Knowledge vaults

2012

2005

KB Timeline

10

© Searchmetrics. All rights reserved. Do not distribute without permission.

• Free collaborative KB• Continuous evolution• Open multilingual Data• mapping to other KBs

Why Wikidata

11

© Searchmetrics. All rights reserved. Do not distribute without permission.

Link content to KB• Entity Linking -- free text to entities

• Blog posts • Tweets • Keywords• User-generated Contents

• Entities from a knowledge base• Wikipedia• Wikidata• Domain-specific KBs

12

© Searchmetrics. All rights reserved. Do not distribute without permission.

Image from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM 2008

Entity Linking

13

© Searchmetrics. All rights reserved. Do not distribute without permission.

• Identify important keywords to link in the text

• Link to right entity

Main Problems

14

© Searchmetrics. All rights reserved. Do not distribute without permission.

Dictionary of keywords to KB entitiesSearch keyword mentions in text

15

© Searchmetrics. All rights reserved. Do not distribute without permission.

Keyword to wiki uris in top SERP

16

© Searchmetrics. All rights reserved. Do not distribute without permission.

Not all keywords are useful

Keyword Cleaning:

• Navigational or factual words

• Non-frequent words

• Non-latin letters

17

© Searchmetrics. All rights reserved. Do not distribute without permission.

Keyword Filtering: • Starting or ending tokens • Stopwords • Part-of-speech tags• Wikipedia popularity:

• popular wiki uris for one keyword• Search popularity:

• popular keywords for one wiki uri

Not all keywords are useful

18

© Searchmetrics. All rights reserved. Do not distribute without permission.

Search Popularity Filtering Keyword Search Popularity (Volume)

germany 268583

germany facts 4291

germany article 24

german encyclopedia 23

germany encyclopedia 19

germany t 18

ger many 16

19

© Searchmetrics. All rights reserved. Do not distribute without permission.

parse wikidata dump & extract entities as json

Entity data{ entity: "Berlin", Freebase Id: "/m/0156q", OpenStreetMap Relation identifier: 62422, alias: ["Berlin, Germany"], capital of: [ "Germany", "Kingdom of Prussia", "Weimar Republic", "Brandenburg-Prussia", "Free State of Prussia", ... ], contains administrative territorial entity: [ "Mitte", "Friedrichshain-Kreuzberg", "Pankow", "Charlottenburg-Wilmersdorf", "Spandau", "Steglitz-Zehlendorf", "Tempelhof-Schöneberg", "Neukölln", "Treptow-Köpenick", ... ], coordinate location: [ { altitude: null, latitude: 52.516666666667, longitude: 13.383333333333, precision: 0.016666666666667 } ], country: "Germany", ... ... }

20

© Searchmetrics. All rights reserved. Do not distribute without permission.

Link to the right Wikipedia entityWord Sense Disambiguation

21

© Searchmetrics. All rights reserved. Do not distribute without permission.

d

Tree 92.82%

Tree (graph theory) 2.94%

Tree (data structure) 2.57%

Tree (set theory) 0.15%

Phylogenetic tree 0.07%

Christmas tree 0.07%

Binary tree 0.04%

Family tree 0.04%

… ...

Link to Most Common Entities

e ew

ew

LL

i

,

,

ew entity , text surface with LinksofNumber

Entity Wikipedia Commnoness

(Milne and Witten 2008b)tree

22

© Searchmetrics. All rights reserved. Do not distribute without permission.

https://en.wikipedia.org/wiki/Tree_data_structure

https://en.wikipedia.org/wiki/Tree

Disambiguation

23

© Searchmetrics. All rights reserved. Do not distribute without permission.

Disambiguation using context

24

© Searchmetrics. All rights reserved. Do not distribute without permission.

• Build a Word2Vec model for Wikiepdia entity

• Calculate Word2Vec similarity to contextual entities

contextcontext

TreestructureTree_data_ )(similarity)(similarity

Entity Disambiguation

25

© Searchmetrics. All rights reserved. Do not distribute without permission.

Relatedness between Entities

26

© Searchmetrics. All rights reserved. Do not distribute without permission.Image from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links

Entity Relatedness

27

© Searchmetrics. All rights reserved. Do not distribute without permission.

• Jaccard similarity

• Word2Vec similarity of entity to context

eeee

and entity tolinks of Union and entity tolinks of onIntersecti

Relatedness Score

28

© Searchmetrics. All rights reserved. Do not distribute without permission.

Wikipedia Data Parsing

29

© Searchmetrics. All rights reserved. Do not distribute without permission.

Wikipedia Dump

'''Berlin''' is the [[Capital city|capital]] of [[Germany]] and one of its 16 [[states of Germany|states]]. With a population of approximately 3.5 million people,<ref name="Population" /> Berlin is the second [[Largest cities of the European Union by population within city limits|most populous city proper]] and the seventh [[Largest urban areas of the European Union|most populous urban area]] in the [[European Union]].

30

© Searchmetrics. All rights reserved. Do not distribute without permission.

Wikipedia Article as Json

31

© Searchmetrics. All rights reserved. Do not distribute without permission.

Word2Vector Training• Collection of plain article text

... ...can4linux ||open_source|| ||controller_area_network|| ||linux_kernel|| ||device_driver|| development started 1990s philips 82c200 controller stand chip 1995 version created bus linux laboratory automation project linux lab project ||freie_universität_berlin|| nxp sja1000 successor supported controller philips 82c200 intel 82527 development powerful ||microcontroller||s integrated controllers capable ... ...

32

© Searchmetrics. All rights reserved. Do not distribute without permission.

Linking vectors• Pairs of uri, annotations

outlink vector [Capital_City, Germany , States_of_Germany, European_Union,Spree, Havel, Berlin-Brandenburg_Metropolitan_Region, ... ... ]

inlink vector [Germany, Prussia, Berlin_Wall, Albert_Einstein, Kosmos_(Berlin), Berlin_International_Film_Festival, .. .. ]

33

© Searchmetrics. All rights reserved. Do not distribute without permission.

Wikipedia Popularity• Aggregation of annotations

Surface text Wiki entity Popularity

United States United_States 174338

World War II World_War_II 106483

India India 95966

France France 94666

American United_States 85976

Iran Iran 83249

Australia Australia 76655

Germany Germany 76384

34

© Searchmetrics. All rights reserved. Do not distribute without permission.

Overall SystemKeywordDatabase

KeywordProcessing

Parser

UserContent

KeywordMatching

Disam-biguation

Relatedness calculation Result

Wikipedia Popularity

Entity Linking API

WikiParser

W2VModel

WikiLinksKeyword

to KB entities

35

© Searchmetrics. All rights reserved. Do not distribute without permission.

• https://github.com/piskvorky/gensim• https://github.com/jodaiber/Annotated-WikiExtractor• https://dumps.wikimedia.org/• https://dumps.wikimedia.org/wikidatawiki/entities/

Resources

36

© Searchmetrics. All rights reserved. Do not distribute without permission.

Thank you

37

© Searchmetrics. All rights reserved. Do not distribute without permission.

Questions?

f.xu@searchmetrics.com

We are hiring

top related