symposium on bias and diversity in ir european …machine learning ? missing person 01.09.2011...
Post on 05-Jul-2020
3 Views
Preview:
TRANSCRIPT
TEMPORAL FACT EXTRACTION, DISAMBIGUATION, AND EVOLUTION
Arturas Mazeika and Marc Spaniol
Symposium on Bias and Diversity in IR
European Summer School of Information Retrieval
August, 29-September, 2, 2011, Koblenz, Germany
APPLICATION: ENTITY TIMELINES
• Harvesting NEs
• Extracting time
• Canonicalization
• YAGO ontology
01.09.2011 2Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
OUTLINE
• Querying semantic knowledge bases
• Harvesting facts
– Wikipedia
– Web
– Temporal facts
01.09.2011 3Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
QUERYING THE SEMANTIC WEB
01.09.2011 4Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
WHAT’S WORKING? WHAT’S NOT?
QUERYING THE SEMANTIC WEB
01.09.2011 5Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
Query:
politicians who are
also scientists ?
?x isa politician .
?x isa scientist
Results:
Benjamin Franklin
Zbigniew Brzezinski
Angela Merkel…
http://www.mpi-inf.mpg.de/yago-naga/
QUERYING THE SEMANTIC WEB
01.09.2011 6Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
http://www.mpi-inf.mpg.de/yago-naga/
YAGO Entity
Max_Planck
Apr 23, 1858
Person
City
Country
subclass
Location
subclass
instanceOf
subclass
bornOn
“Max
Planck”
means(0.9)
subclass
Oct 4, 1947 diedOn
Kiel
bornInNobel Prize
Erwin_Planck
FatherOfhasWon
Scientist
means
“Max Karl Ernst
Ludwig Planck”
Physicist
instanceOf
subclass
Germany
Politician
Angela Merkel
Schleswig-
Holstein
State
“Angela
Dorothea
Merkel”
Oct 23, 1944diedOn
means(0.1)
instanceOfinstanceOf
subclass
subclass
means
“Angela
Merkel”
means
citizenOf
instanceOf
instanceOf
locatedIn
locatedIn
subclass
Accuracy
95%
(Suchanek et al.: WWW‟07,
Hoffart et al.: WWW„11)
01.09.2011 7Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
KNOWLEDGE REPRESENTATION
...
• RDF (Resource Description Framework, W3C):
subject-property-object (SPO) triples, binry relations
structure, but no (prescriptive) schema
• Relations, frames
• Description logics: OWL, DL-lite
• Higher-order logics, epistemic logics
facts (RDF triples):1. (JimGray, hasAdvisor, MikeHarrison)
2. (SurajitChaudhuri, hasAdvisor, JeffUllman)
3. (Madonna, marriedTo, GuyRitchie)
4. (NicolasSarkozy, marriedTo, CarlaBruni)
facts about facts:5: (1, inYear, 1968)
6: (2, inYear, 2006)
7: (3, validFrom, 22-Dec-2000)
8: (3, validUntil, Nov-2008)
9: (4, validFrom, 2-Feb-2008)
10: (2, source, SigmodRecord)
01.09.2011 8Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
http://www.mpi-inf.mpg.de/yago-naga/
YAGO Entity
Max_Planck
Apr 23, 1858
Person
City
Country
subclass
Location
subclass
instanceOf
subclass
bornOn
“Max
Planck”
means(0.9)
subclass
Oct 4, 1947 diedOn
Kiel
bornInNobel Prize
Erwin_Planck
FatherOfhasWon
Scientist
means
“Max Karl Ernst
Ludwig Planck”
Physicist
instanceOf
subclass
Germany
Politician
Angela Merkel
Schleswig-
Holstein
State
“Angela
Dorothea
Merkel”
Oct 23, 1944diedOn
means(0.1)
instanceOfinstanceOf
subclass
subclass
means
“Angela
Merkel”
means
citizenOf
instanceOf
instanceOf
locatedIn
locatedIn
subclass
Accuracy
95%
(Suchanek et al.: WWW‟07,
Hoffart et al.: WWW„11)
01.09.2011 9Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
WORDNET THESAURUS [MILLER/FELLBAUM 1998]
http://wordnet.princeton.edu/
3 concepts / classes & their
synonyms (synset„s)
01.09.2011 10Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
WORDNET THESAURUS [MILLER/FELLBAUM 1998]
subclasses
(hyponyms)
superclasses
(hypernyms)
01.09.2011 11Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
MAPPING: WIKIPEDIA WORDNET[Suchanek: WWW„07, Ponzetto&Strube: AAAI„07]
Jim Gray(computerspecialist)
Computer
Scientist
American
Scientist
Sailor,
Crewman
Missing
Person
Chemist
Artist
01.09.2011 12Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
WIKIPEDIA CATEGORIES
01.09.2011 13Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
?
?
?
American
Sailor,
Crewman
MAPPING: WIKIPEDIA WORDNET[Suchanek: WWW„07]
Jim Gray(computerspecialist)
Computer
Scientist
Data-base
Fellow (1), Comrade
Fellow (2),Colleague
Fellow (3)(of Society)
Scientist
Member (1),Fellow
Member (2),Extremity
AmericanComputerScientists
DatabaseResearcher
Fellows ofthe ACM
PeopleLost at Sea
instanceOf subclassOf
name similarity
(edit dist., n-gram overlap) ?
context similarity
(word/phrase level) ?
machine learning ?
Missing
Person
01.09.2011 14Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
MAPPING: WIKIPEDIA WORDNET[Suchanek: WWW„07, Ponzetto & Strube: AAAI„07]
Analyzing category names noun group parser:
American Musicians of Italian Descent
American Folk Music of the 20th Century
American Indy 500 Drivers on Pole Positions
Head word is key, should be in plural for instanceOf
headpre-modifier post-modifier
headpre-modifier post-modifier
headpre-modifier post-modifier
Given: entity e in Wikipedia categories c1, …, ck
Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN class cProblem: vagueness & ambiguity of names c1, …, ck
01.09.2011 15Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
MAPPING WIKIPEDIA ENTITIES TO WORDNET CLASSES
Heuristic Method:for each ci do
if head word w of category name ci is plural {
1) match w against synsets of WordNet classes2) choose best fitting class c and set e c3) expand w by pre-modifier and set ci w+ c
}
• can also derive features this way
• feed into supervised classifier
[Suchanek: WWW„07]
tuned conservatively: high precision, reduced recall
Given: entity e in Wikipedia categories c1, …, ck
Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN class cProblem: vagueness & ambiguity of names c1, …, ck
01.09.2011 16Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
http://www.mpi-inf.mpg.de/yago-naga/
YAGO Entity
Max_Planck
Apr 23, 1858
Person
City
Country
subclass
Location
subclass
instanceOf
subclass
bornOn
“Max
Planck”
means(0.9)
subclass
Oct 4, 1947 diedOn
Kiel
bornInNobel Prize
Erwin_Planck
FatherOfhasWon
Scientist
means
“Max Karl Ernst
Ludwig Planck”
Physicist
instanceOf
subclass
Germany
Politician
Angela Merkel
Schleswig-
Holstein
State
“Angela
Dorothea
Merkel”
Oct 23, 1944diedOn
means(0.1)
instanceOfinstanceOf
subclass
subclass
means
“Angela
Merkel”
means
citizenOf
instanceOf
instanceOf
locatedIn
locatedIn
subclass
Accuracy
95%
(Suchanek et al.: WWW‟07,
Hoffart et al.: WWW„11)
01.09.2011 17Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
WIKIPEDIA INFOBOXES
harvest by
extraction rules:
• regex matching
• type checking
(?i)IBL\|BEG\s*awards\s*=\s*(.*?)IBL\|END"
=> "$0 hasWonPrize @WikiLink($1)
01.09.2011 18Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
TYPE CHECKING
Use consistency constraints
to prune false candidates
spouse(Hillary,Bill)
spouse(Carla,Nicolas)
spouse(Cecilia,Nicolas)
spouse(Carla,Ben)
spouse(Carla,Mick)
spouse(Carla, Sofie)
spouse(x,y) diff(y,z) spouse(x,z)
f(Hillary)
f(Carla)
f(Cecilia)
f(Sofie)
m(Bill)
m(Nicolas)
m(Ben)
m(Mick)
spouse(x,y) f(x) spouse(x,y) m(y)
spouse(x,y) (f(x) m(y)) (m(x) f(y))
FOL rules (restricted):
ground atoms:
spouse(x,y) diff(w,x) spouse(w,y)
Simple type checks:
marriedTo (Planck, quantum physics)
01.09.2011 19Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
O. Etzioni, M. Banko, M.J. Cafarella: Machine Reading, AAAI ‚06
T. Mitchell et al.: Populating the Semantic Web by Macro-Reading Internet Text, ISWC’09
MACHINE READING
01.09.2011 20Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PROSPERA: PATTERN-BASED HARVESTING
Facts Patterns
(Hillary, Bill)
(Carla, Nicolas)
& Fact Candidates
X and her husband Y
X and Y on their honeymoon
X and Y and their children
X has been dating with Y
X loves Y
…• good for recall
• noisy, drifting
• not robust enough
for high precision
(Angelina, Brad)
(Hillary, Bill)
(Victoria, David)
(Carla, Nicolas)
(Angelina, Brad)
(Yoko, John)
(Carla, Benjamin)
(Larry, Google)
(Kate, Pete)
(Victoria, David)
01.09.2011 21Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
• “attended secondary school” goodPattern for bornIn?
• “attended secondary school” goodPattern for attndSchool?
PROSPERA: REASONING EXAMPLE
• Elvis attended secondary school in Memphis.
• Elvis isBornIn Mississippi
• A person cannot be born in two places
• Memphis not isIn Mississippi
• => attended secondary school is not a goodPattern for bornIn
• Herrmann Einstein attended secondary school in Germany.
• Hermann Einstein attendedSchoolIn Stuttgart
• Stuttgart is located in Germany
• => attended secondary school is goodPattern for bornIn
• Weighted max sat problem
• Find the best assignment of patterns to relations
• The assignment should maximize the weights of correct facts01.09.2011 22Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
• Formalization
01.09.2011 Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011 23
PROSPERA: REASONING EXAMPLE
•Predefined rules
•Instantiate Nes
•Find best pattern-
relation assignmen
PROSPERA: WEB-SCALE EXPERIMENTS
• on ClueWeb„09 corpus (500 Mio. English Web pages)
• with Hadoop cluster of 10x16 cores and 10x48 GB memory
PROSPERA ReadTheWeb [CMU]
Relation #Facts Precision Prec@1000 #Facts Precision
AthletePlaysForTeam 14685 82% 100% 456 100%
TeamPlaysAgainstTeam 15170 89% 100% 1068 99%
TeamMate 9666 86% 100% --- ---
FacultyAt 4394 96% 100% --- ---
www.mpi-inf.mpg.de/yago-naga/prospera/
[N. NAKASHOLE ET AL.: WSDM‟11]
01.09.2011 24Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: TEMPORAL KNOWLEDGE AND EVOLUTION• Relation Types
– Base relations: (entityi, entityj)– Temporal relations: (entityi, entityj)@tk
– Type signature: (TYPEi, TYPEj), e.g., bornIn(PERSON, LOCATION)
• Input:– A set of relations of interest with their type signatures.
• e.g., playsForClubTemp(PERSON, CLUB), …
– A small number of labeled positive/negative seed facts for each relation.
• e.g., (David_Beckham, Real_Madrid)@2007 vs. (David_Beckham, Manchester_U)@2007
– A large corpus of textual documents.• e.g., ClueWeb09, Wikipedia, …
• Output:– A set of new facts for each relation.
• e.g., (Lionel_Messi, FC_Barcelona)@2008, (Michael Ballack, Bayern_Munich)@2005, (Ronaldo, Real_Madrid)@2004 ...
Y. Wang et al: CIKM‟11
01.09.2011 25Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: FRAMEWORK
Candidate Gathering
Pattern Analysis
Graph Construction
Label Propagation
Base & Temporal Facts
Corpus +Relations of Interest
Pos./Neg.Seed Facts
Fact
CandidatesCandidate
Sentences
Fact
CandidatesSeed
Patterns
Graph
01.09.2011 26Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: CANDIDATE GATHERING
• We extract and disambiguate entities from the corpora (sentence level)
• Fact candidates– Two entities, whose types are pertinent to a relation of
interest, appear in the same sentence
– For temporal relations, it also requires an associated temporal mention appearing in the same sentence
• Candidate sentences – A sentence that contains at least one fact candidate.
“Beckham played for Real and Galaxy.”
(Real_Madrid, LA_Galaxy)
“Beckham joined Real in 2003.”
(David_Beckham, Real_Madrid)@2003
(David_Beckham, Real_Madrid)
01.09.2011 27Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: PATTERN REPRESENTATION
• Pattern is a string between two named entities• The string is changed and generalized for better IE
• Surface String– “finally moved from Real Madrid before his recent joining”
• Compressed surface string– Only keep verbs, nouns and prepositions.– “move from Real Madrid before join”
• Compressed and Lifted surface string– Replacing entity mentions by their types– “move from CLUB before join”
• n-grams based on compressed and lifted surface string– {“move from CLUB”, “from CLUB before”, “CLUB before join”}
• Final representation: (TYPE1, TYPE2, p)– The pattern for fact candidate (David_Beckham, LA_Galaxy) is
(PERSON, CLUB, {“move from CLUB”, “from CLUB before”, “CLUB before join”})
“Beckham finally moved from Real Madrid before
his recent joining LA_Galaxy in 2007.”
01.09.2011 28Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: PATTERN ANALYSIS
Assign patterns to relations. The pattern
– must be frequent in positive seeds
– must be infrequently in negative seeds:
conf(p, Ri) = Num_Pos / (Num_Pos + Num_Neg)
playsForClub =
{“sign for”:1.0, “score for”:1.0, “stay at”:0.8}
01.09.2011 29Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: GRAPH GENERATION
• A weighted undirected graph G(V, E, W)– Vertexes are either: or
• VF (facts): (David_Beckham,Real_Madrid)@2003
• VF (patterns): (PLAYER,CLUB,{“sign for”:1.0, “score for”:1.0})
– Edge set E• Edge Type 1: Between a fact vertex vf and a pattern vertex vp
– The edge weight is calculated using the number of sentences which contain the fact of vf and the pattern of vp
• Edge Type 2: Between two pattern vertices– The edge weight is defined as the similarity of the two patterns.
» Two patterns’ type signature
» Whether sharing the same verb and preposition
» Distance-weighted Jaccard similarity
01.09.2011 30Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
Beckham‟s new contract with Real starts from 2003.
Beckham finally moved to Real in 2003.
Beckham finally moved to Spain in 2003.
Rafael‟s last minute move to Hotspur is the best transfer in 2010.
# of sentences containing VF1 and VP1 = 25
W(VF1, VP1) = 1-e(-1)*0.03*25 =0.7
0.7
0.9
0.3
0.6
VP2 = {“move to”:1}
VP4 = {“minute move”:1, “move to”:2}
Jaccard = {“move to”}/{”minute move”, “move to”}
=2/(1+2) = 0.67
0.67
01.09.2011 31Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: LABEL PROPAGATION (NO CONTRAINTS)
• Labels on vertices indicate relation and its conf.• Each vertex vi gets Yi vector of (m+1) labels• Yi
r indicates the initial confidence of vertex vi
holding the relation r.– Fact vertex: if vi is for r, then Yi
r = 1.0.– Pattern vertex: if vi is a seed pattern of r, then Yi
r = 1.0.– Otherwise: Yi
r = 0.
YVF1:(playsForClub: 1.0; joinsClub: 0; leavesClub: 0; none: 0)
YVP2:(playsForClub: 0; joinsClub: 1.0; leavesClub: 0; none: 0)
• Labels propagate via edges into Ŷi
seed label loss Edge loss Regularization
ŶVF1(playsForClub: 0.8; joinsClub: 0.7; leavesClub: 0.1; none: 0.01)playsForClub
1
1 1 1
2^
2
1,
2^^
1
2_^
)()()(m
l
n
i
n
i
l
i
l
i
n
ji
l
j
l
iij
l
i
l
i
l rYYYwYYsi
01.09.2011 32Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: INCORPORATING CONSTRAINTS
• Inclusion constraints(IC)– Relation level: joinsClub(David_Beckham, Real_Madrid) →
worksForClub(David_Beckham, Real_Madrid)
• Exclusion constraints(EC)
– Relation level: isSonOf(George_W._Bush, George_H._W._Bush) NOT isDaughterOf(George_W._Bush, George_H._W._Bush)
– Entity level: bornIn(Albert_Einstein, Germany) →NOT bornIn(Albert_Einstein, United_States)
IC EC; entity levelEC; relational level
1
1 1 1,
2^^
5
1
2^^
4
2^^
3 )()()(m
l k
n
i
n
ji
l
j
l
i
l
k
n
i
k
i
l
i
lkk
i
l
i
lk YYeYYdYYc ij
1
1 1 1
2^
2
1,
2^^
1
2_^
)()()(m
l
n
i
n
i
l
i
l
i
n
ji
l
j
l
iij
l
i
l
i
l rYYYwYYsi
01.09.2011 33Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: EXPERIMENTAL STUDY
• Date Set
– 23000 soccer players and celebrities in Wikipedia articles
– 110000 online news articles contained in “FIFA 100 list”
– 88000 news mentioned in “Forbes 100 list”
• Prominent facts are chosen as seed facts according to their frequency
01.09.2011 34Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: BASE FACT EXTRACTION RESULT
100 positive seeds and 10 negative seeds
01.09.2011 35Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
PRAVDA: TEMPORAL FACT EXTRACTION RESULT
01.09.2011 36Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
CONCLUSION AND OUTLOOK
Entities & Classes
Relationships
Temporal Knowledgewidely open (fertile) research ground:
• uncertain / incomplete temporal scopes of facts
• joint reasoning on ER facts and time scopes
good progress, but many challenges left:
• recall & precision by patterns & reasoning
• efficiency & scalability
• soft rules, hard constraints, richer logics, …
• open-domain discovery of new relation types
strong success story, some problems left:
• large taxonomies of classes with individual entities
• long tail calls for new methods
• entity disambiguation remains grand challenge
01.09.2011 37Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
REFERENCES
• F.M. Suchanek, G. Kasneci, G. Weikum: Yago: a core of semantic knowledge. WWW 2007
• J. Hoffart, F.M. Suchanek, K. Berberich, et al.: YAGO2: exploring and querying • world knowledge in time, space, context, and many languages. WWW 2011• F.M. Suchanek et al.: SOFIE: a self-organizing framework for information
extraction. WWW 2009• Y. Wang, M. Zhu, L. Qu, M. Spaniol, G. Weikum: Timely YAGO: harvesting,
querying, and visualizing temporal knowledge from Wikipedia. EDBT 2010• Y. Wang, L. Qu, B. Yang, M. Spaniol, G. Weikum: Harvesting Facts from
Textual Web Sources by Constrained Label Propagation. CIKM 2011• A. Mazeika, T. Tylenda, G. Weikum: Entity Timelines: Visual Analytics and
Named Entity Evolution. CIKM 2011
01.09.2011 38Temporal Fact extraction, Disambiguation, and Evolution, SBDIR@ESSIR 2011
top related