1
© Searchmetrics. All rights reserved. Do not distribute without permission.
Enriching content with Knowledge Base by Search Keywords and Wikidata
Fang [email protected]@allxufang
2
© Searchmetrics. All rights reserved. Do not distribute without permission.
Data Science@Searchmetrics
Data driven search and content optimization marketing
• Learning from keywords
• Content optimization
• Data visualization
3
© Searchmetrics. All rights reserved. Do not distribute without permission.
Looooots of Data
• 120 Million Domains
• 600 Million Keywords
• 120 Billion Links
• 25,000 Billion Social Signals
• 25 PB raw data
4
© Searchmetrics. All rights reserved. Do not distribute without permission.
Authors submit content üRate the content’s effectiveness ü Feedback to optimize and enrich it
Content Production in Real-time
5
© Searchmetrics. All rights reserved. Do not distribute without permission.
Beyond keywords
• Keyword • Typos• Ambiguous• Sparse
• Entity • Augmented with
metadata• Relations among entities
6
© Searchmetrics. All rights reserved. Do not distribute without permission.
Q64
Entity
7
© Searchmetrics. All rights reserved. Do not distribute without permission.
8
© Searchmetrics. All rights reserved. Do not distribute without permission.http://brendangriffen.com/blog/gow-programming-languages
Knowledge Base (KB)
9
© Searchmetrics. All rights reserved. Do not distribute without permission.
20012012
20142008
Knowledge vaults
2012
2005
KB Timeline
10
© Searchmetrics. All rights reserved. Do not distribute without permission.
• Free collaborative KB• Continuous evolution• Open multilingual Data• mapping to other KBs
Why Wikidata
11
© Searchmetrics. All rights reserved. Do not distribute without permission.
Link content to KB• Entity Linking -- free text to entities
• Blog posts • Tweets • Keywords• User-generated Contents
• Entities from a knowledge base• Wikipedia• Wikidata• Domain-specific KBs
12
© Searchmetrics. All rights reserved. Do not distribute without permission.
Image from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM 2008
Entity Linking
13
© Searchmetrics. All rights reserved. Do not distribute without permission.
• Identify important keywords to link in the text
• Link to right entity
Main Problems
14
© Searchmetrics. All rights reserved. Do not distribute without permission.
Dictionary of keywords to KB entitiesSearch keyword mentions in text
15
© Searchmetrics. All rights reserved. Do not distribute without permission.
Keyword to wiki uris in top SERP
16
© Searchmetrics. All rights reserved. Do not distribute without permission.
Not all keywords are useful
Keyword Cleaning:
• Navigational or factual words
• Non-frequent words
• Non-latin letters
17
© Searchmetrics. All rights reserved. Do not distribute without permission.
Keyword Filtering: • Starting or ending tokens • Stopwords • Part-of-speech tags• Wikipedia popularity:
• popular wiki uris for one keyword• Search popularity:
• popular keywords for one wiki uri
Not all keywords are useful
18
© Searchmetrics. All rights reserved. Do not distribute without permission.
Search Popularity Filtering Keyword Search Popularity (Volume)
germany 268583
germany facts 4291
germany article 24
german encyclopedia 23
germany encyclopedia 19
germany t 18
ger many 16
19
© Searchmetrics. All rights reserved. Do not distribute without permission.
parse wikidata dump & extract entities as json
Entity data{ entity: "Berlin", Freebase Id: "/m/0156q", OpenStreetMap Relation identifier: 62422, alias: ["Berlin, Germany"], capital of: [ "Germany", "Kingdom of Prussia", "Weimar Republic", "Brandenburg-Prussia", "Free State of Prussia", ... ], contains administrative territorial entity: [ "Mitte", "Friedrichshain-Kreuzberg", "Pankow", "Charlottenburg-Wilmersdorf", "Spandau", "Steglitz-Zehlendorf", "Tempelhof-Schöneberg", "Neukölln", "Treptow-Köpenick", ... ], coordinate location: [ { altitude: null, latitude: 52.516666666667, longitude: 13.383333333333, precision: 0.016666666666667 } ], country: "Germany", ... ... }
20
© Searchmetrics. All rights reserved. Do not distribute without permission.
Link to the right Wikipedia entityWord Sense Disambiguation
21
© Searchmetrics. All rights reserved. Do not distribute without permission.
d
Tree 92.82%
Tree (graph theory) 2.94%
Tree (data structure) 2.57%
Tree (set theory) 0.15%
Phylogenetic tree 0.07%
Christmas tree 0.07%
Binary tree 0.04%
Family tree 0.04%
… ...
Link to Most Common Entities
e ew
ew
LL
i
,
,
ew entity , text surface with LinksofNumber
Entity Wikipedia Commnoness
(Milne and Witten 2008b)tree
22
© Searchmetrics. All rights reserved. Do not distribute without permission.
https://en.wikipedia.org/wiki/Tree_data_structure
https://en.wikipedia.org/wiki/Tree
Disambiguation
23
© Searchmetrics. All rights reserved. Do not distribute without permission.
Disambiguation using context
24
© Searchmetrics. All rights reserved. Do not distribute without permission.
• Build a Word2Vec model for Wikiepdia entity
• Calculate Word2Vec similarity to contextual entities
contextcontext
TreestructureTree_data_ )(similarity)(similarity
Entity Disambiguation
25
© Searchmetrics. All rights reserved. Do not distribute without permission.
Relatedness between Entities
26
© Searchmetrics. All rights reserved. Do not distribute without permission.Image from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links
Entity Relatedness
27
© Searchmetrics. All rights reserved. Do not distribute without permission.
• Jaccard similarity
• Word2Vec similarity of entity to context
eeee
and entity tolinks of Union and entity tolinks of onIntersecti
Relatedness Score
28
© Searchmetrics. All rights reserved. Do not distribute without permission.
Wikipedia Data Parsing
29
© Searchmetrics. All rights reserved. Do not distribute without permission.
Wikipedia Dump
'''Berlin''' is the [[Capital city|capital]] of [[Germany]] and one of its 16 [[states of Germany|states]]. With a population of approximately 3.5 million people,<ref name="Population" /> Berlin is the second [[Largest cities of the European Union by population within city limits|most populous city proper]] and the seventh [[Largest urban areas of the European Union|most populous urban area]] in the [[European Union]].
30
© Searchmetrics. All rights reserved. Do not distribute without permission.
Wikipedia Article as Json
31
© Searchmetrics. All rights reserved. Do not distribute without permission.
Word2Vector Training• Collection of plain article text
... ...can4linux ||open_source|| ||controller_area_network|| ||linux_kernel|| ||device_driver|| development started 1990s philips 82c200 controller stand chip 1995 version created bus linux laboratory automation project linux lab project ||freie_universität_berlin|| nxp sja1000 successor supported controller philips 82c200 intel 82527 development powerful ||microcontroller||s integrated controllers capable ... ...
32
© Searchmetrics. All rights reserved. Do not distribute without permission.
Linking vectors• Pairs of uri, annotations
outlink vector [Capital_City, Germany , States_of_Germany, European_Union,Spree, Havel, Berlin-Brandenburg_Metropolitan_Region, ... ... ]
inlink vector [Germany, Prussia, Berlin_Wall, Albert_Einstein, Kosmos_(Berlin), Berlin_International_Film_Festival, .. .. ]
33
© Searchmetrics. All rights reserved. Do not distribute without permission.
Wikipedia Popularity• Aggregation of annotations
Surface text Wiki entity Popularity
United States United_States 174338
World War II World_War_II 106483
India India 95966
France France 94666
American United_States 85976
Iran Iran 83249
Australia Australia 76655
Germany Germany 76384
34
© Searchmetrics. All rights reserved. Do not distribute without permission.
Overall SystemKeywordDatabase
KeywordProcessing
Parser
UserContent
KeywordMatching
Disam-biguation
Relatedness calculation Result
Wikipedia Popularity
Entity Linking API
WikiParser
W2VModel
WikiLinksKeyword
to KB entities
35
© Searchmetrics. All rights reserved. Do not distribute without permission.
• https://github.com/piskvorky/gensim• https://github.com/jodaiber/Annotated-WikiExtractor• https://dumps.wikimedia.org/• https://dumps.wikimedia.org/wikidatawiki/entities/
Resources
36
© Searchmetrics. All rights reserved. Do not distribute without permission.
Thank you
37
© Searchmetrics. All rights reserved. Do not distribute without permission.
Questions?
We are hiring