but what do we actually know - on knowledge base recall

24
But what do we actually know? On knowledge base recall Simon Razniewski Free University of Bozen-Bolzano, Italy

Upload: srazniewski

Post on 15-Feb-2017

84 views

Category:

Science


1 download

TRANSCRIPT

Page 1: But what do we actually know - On knowledge base recall

But what do we actually know?

On knowledge base recall

Simon RazniewskiFree University of Bozen-Bolzano, Italy

Page 2: But what do we actually know - On knowledge base recall

2

Background• TU Dresden, Germany: Diplom (Master) 2010• Free University of Bozen-Bolzano, Italy

• 2011 - 2014: PhD (on reasoning about data completeness)• 2014 - now: Fixed term assistant professor• Research visits at UCSD (2012), AT&T Labs-Research (2013), UQ (2015)

Bolzano

Trilingual Ötzi 1/8th of EU apples

Page 3: But what do we actually know - On knowledge base recall

3

How complete are knowledge bases?=recall

Page 4: But what do we actually know - On knowledge base recall

4

KBs are pretty incomplete DBpedia: contains 6 out of 35

Dijkstra Prize winners

YAGO: the average number of children per person is 0.02

Google Knowledge Graph: ``Points of Interest’’ – Completeness?

Page 5: But what do we actually know - On knowledge base recall

5

KBs are pretty complete

Wikidata: 2 out of 2 children of Obama

Google Knowledge Graph: 36 out of 48 Tarantino movies

DBpedia: 167 out of 199 Nobel laureates in Physics

Page 6: But what do we actually know - On knowledge base recall

6

So, how complete are KBs?

Page 7: But what do we actually know - On knowledge base recall

7

[Dong et al., KDD 2014] KB engineers have only tried to make KBs bigger. The point, however is to understand what they are actually

trying to approximate.

There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we

do not know. But there are also unknown unknowns – the ones we

don't know we don't know.

Page 8: But what do we actually know - On knowledge base recall

8

Knowledge Bases as seen by [Rumsfeld, 2002]Known knowns: The plain facts in a KB

• Trump’s birth date• Hillary’s nationality• …

Known unknown: The easy stuff• NULL values/blank nodes• Missing functional/mandatory values

Unknown unknowns: The interesting rest• Are all children of John there?• Does Mary play a musical instrument?• Does Bob have other nationalities?• …

Not KB completion!• What other children does John have?• Which instruments does Mary play?• Which nationalities does Bob have?

Page 9: But what do we actually know - On knowledge base recall

9

Outline1. Assessing completeness from inside the KB

a) Rule miningb) Classification

2. Assessing completeness using textc) Cardinalitiesd) Recall-aware information extraction

3. Presenting the completeness of KBs4. The meaning of it all

e) When is an entity complete?f) When is an entity more complete than another?g) Are interesting facts complete?

Page 10: But what do we actually know - On knowledge base recall

10

1. Asessing completeness from inside the KB

Page 11: But what do we actually know - On knowledge base recall

11

1a) Rule Mining [Galarraga et al., WSDM 2017]hockeyPlayer(x) Incomplete(x, hasChild)scientist(x), hasWonNobelPrize(x) Complete(x, graduatedFrom)Challenge: No proper theory for consensus across multiple rules

human(x) Complete(x, graduatedFrom)teacher(x) Incomplete(x, graduatedFrom)professor(x) Complete(x, graduatedFrom)John (human, teacher, professor) ∈ Complete(John, graduatedFrom)?

Maybe the wrong approach?

Page 12: But what do we actually know - On knowledge base recall

12

1b) A classification problemInput:

Entity ePredicate p

Question: Are all triples (e, p, _) in the KB?

Output: Yes/No

Features: Facts, popularity measures, textual context, …Training data: Crowdsourcing under constraints, deletion, popularity, …

ObamahasChild

(Obama, hasChild , _)

Yes (Wikidata)

Page 13: But what do we actually know - On knowledge base recall

13

2. Assessing completeness using text

Page 14: But what do we actually know - On knowledge base recall

14

2c) Cardinality extraction [Mirza et al., Poster@ISWC 2016]Text: “Barack and Michelle have two children, and […]”

Manually created patterns to extract children cardinalities from Wikipedia Found that about 2k entities have complete children, 84k have incomplete children Found evidence for 178% more children than currently in Wikidata

• Especially intriguing for long-tail entities

Open: Automation, other relations

KB: 0 KB: 1 KB: 2

Recall: 0% Recall: 50% Recall: 100%

Page 15: But what do we actually know - On knowledge base recall

15

2d) Recall-aware Information ExtractionTextual information extraction is usually precision-aware

“John was born in Malmö, Sweden, on […].” citizenship(John, Sweden) – precision 95% “John grew up in Malmö, Sweden and […]” citizenship(John, Sweden) – precision 80%

What about making it recall-aware?

“John has a son, Tom, and a daughter, Susan.” hc(John, Tom), hc(John, Susan) – recall? “John brought his children Susan and Tom to school.” hc(John, Tom), hc(John, Susan) – recall?

Page 16: But what do we actually know - On knowledge base recall

16

3. Presenting the completeness of KBs

Page 17: But what do we actually know - On knowledge base recall

17

How complete is Wikidata for children?

2.7%

Page 18: But what do we actually know - On knowledge base recall

18

hasChilddate of birthparty membership….

Facets

Occupation Politician 7.5% Soccer player 3.3% Lawyer 8.1% Other 2.2%

Nationality USA 3.8% India 2.7% China 2.2% England 5.5% …

Century of birth <15th century 1.1% 16th century 1.4% …

Gender Male 4.3% Female 3.9%

Select attribute to analyse

Extrapolated completeness: 30.8%Known completeness: 2.7%

Based on:• There are 5371 people of this kind• For these, 231 have children• For these, Wikipedia says there should be 750 children• Average number of children of complete entities is 2.3• Average number of children of unknown people is 0.01• …..

Page 19: But what do we actually know - On knowledge base recall

19

4. The meaning of it all

Page 20: But what do we actually know - On knowledge base recall

20

4e) When is data about an entity complete?Complete(Justin Bieber)?

• Musician: birth date, musical instrument played, band• Scientist: alma mater, field, advisor, awards• Politician: Party, public positions held

What about musicians playing in an orchestra? What about scientists that are also engaged in politics? …..

• Interestingness is relative (“birth date more interesting than handedness”)• Long tail of rare properties

Some work on ranking predicates by relevanceShortcoming: Mostly descriptive (see e.g. Wikidata Property Suggestor)

Page 21: But what do we actually know - On knowledge base recall

21

4f) When is an entity more complete than another?

Is data about Obama more complete than about Trump?

Goal: A notion of relative completeness

Is data about Ronaldo more complete than about Justin Bieber?…..

Crowd studies: relative completeness = fact count?Available as user script for Wikidata (Recoin - Relative Completeness Indicator)

Page 22: But what do we actually know - On knowledge base recall

https://www.wikidata.org/wiki/User:Ls1g/Recoin

Page 23: But what do we actually know - On knowledge base recall

23

4g) Are interesting facts complete?LIGO:Proved gravitation waves that were predicted by Einstein 80 years agoGalileo Galilei:Contrary to the dogma of the time, postulates that the earth orbits the sunReinhold Messner: First person to climb all mountains >8000mt without supplemental oxygen

These are not elementary triples FirstPersonToClimbAllMountainsAbove8000Without(Supplemental oxygen, Reinhold Messner)

1. What are these? Events? Sets of triples? Queries? 2. Where can we get the interestingness score from? Entropy? Pagerank? Text frequency?3. Completeness depends on completeness of context!

Page 24: But what do we actually know - On knowledge base recall

Summary1. Assessing completeness from inside the KB

a) Rule miningb) Classification

2. Assessing completeness using textc) Cardinalitiesd) Recall-aware information extraction

3. Presenting the completeness of KBs4. The meaning of it all

e) When is an entity complete?f) When is an entity more complete than another?g) Are interesting facts complete?

…meet me in room 416