natural language questions for the web of data 1 mohamed yahya, klaus berberich, gerhard weikum max...
TRANSCRIPT
Natural Language Questions for the Web of Data
1 Mohamed Yahya, Klaus Berberich, Gerhard WeikumMax Planck Institute for Informatics, Germany
2 Shady ElbassuoniQatar Computing Research Institute
3 Maya RamanathDept. of CSE, IIT-Delhi, India
4 Volker Tresp4Siemens AG, Corporate Technology, Munich, Germany
EMNLP 2012
QNL Translation to
QNL : Natural Language Questions“Which female actor played in Casablanca and is married to a writer who was born in Rome?”.
QFL: SPARQL 1.0?x hasGender female ?x marriedTo ?w?x isa actor ?w isa writer?x actedIn Casablanca_(film) ?w bornIn Rome
Characteristics of SPARQL :Complex query
good resultsDifficult for the user
Translation
Natural Language Questions for the Web of Data
3
Yago2
YAGO2s is a huge semantic knowledge base, derived from Wikipedia, WordNet and GeoNames.
RelationClass Entities
Architecture of DEANNA.
Natural Language Questions for the Web of Data
5
Phrase detection
A detected phrase p is a pair < Toks, l >Toks : phrasel : label (l {concept, relation})∈
Phrase detectionQNL Phrase
Pr : {<*, relation >}Pc : {<*, concept >}
Phrase detection
e.q. “Which female actor played in Casablanca and is married to a writer who was born in Rome?”
Search instances of the means relation in Yago2
concept phrase detection :
Phrase detection
relation phrase detection : rely on a relation detector based on ReVerb (Fader et al., 2011) with additional POS tag patterns
e.q. “Which female actor played in Casablanca and is married to a writer who was born in Rome?”
Phrase Mapping
to map concept phrases:also Search instances of the means relation in Yago2
to map relation phrases: rely on a corpus of textual patterns to relation mappings
e.q. “Which female actor played in Casablanca and is married to a writer who was born in Rome?”
textual patterns relation
Phrase MappingPhrase Mapping
Q-Unit Generation
Q-Unit GenerationMapping Candidategraph
Dependency parsing :
q-unit is a triple of sets of phrases
Q-Unit GenerationDependency parsing :
identifies triples of tokens:<trel, targ1, targ2>, where trel, targ1, targ2 q∈ NL
who was born in Rome?
nsubjpass(born-3, who-1)auxpass(born-3, was-2)root(ROOT-0, born-3)prep_in(born-3, Rome-5)
e.q.
born
who Rome
trel
targ1targ2
root
nsubjpass in
<born, who, Rome>,
Q-Unit Generation
q-unit is a triple of sets of phrases<{prel P∈ r}, {parg1 P∈ c}, {parg2 P∈ c}> ,trel p∈ rel , targ1 p∈ arg1 , and targ2 p∈ arg2 .
<born, writer, Rome>
triples of tokens phrase
<born, relation ><was born, relation ><Rome, concept ><a writer, concept >
Joint Disambiguation
Joint Disambiguation
1.each phrase is assigned to at most one semantic item2.resolves the phrase boundary ambiguity
(only nonoverlapping phrases are mapped)
Rule
Joint Disambiguation
Disambiguation Graph• Joint disambiguation takes place over a disambiguation
graph DG = (V, E), – V = Vs V∪ p V∪ q
– E = Esim E∪ coh E∪ q
V = Vs V∪ p V∪ q
Vq : a set of placeholder nodes for q–units
Joint Disambiguation
Vs : the set of s-node (s-node is semantic items)
Vp : the set of p-node p-node is phrases Vrp : the set of relation phrases Vrc : the set of concept phrases
Disambiguation Graph
Disambiguation Graph
Eq V⊆ q × Vp × d, d {rel, arg1, arg2}∈Called q-edge
E = Esim E∪ coh E∪ q
Esim V⊆ p × Vs
a set of weighted similarity edges
Ecoh V⊆ s × Vs
a set of weighted coherence edges
Disambiguation Graph
Disambiguation Graph
Edge Weights• Cohsem (Semantic Coherence)
– between two semantic items s1 and s2 as the Jaccard coefficient of their sets of inlinks.
• Three kinds of inlink– InLinks(e)– InLinks(c)– InLinks(r)
Natural Language Questions for the Web of Data
17
InLinks(e)
• InLinks(e): the set of Yago2 entities whose corresponding Wikipedia pages link to the entity.
• e.q. – Let e = Casablanca– InLinks(Casablanca)
= {Marwan_al-Shehhi , Ingrid_Bergman, …, Morocco…}
Natural Language Questions for the Web of Data
18
InLinks(c)• InLinks(c) = ∪e c ∈ Inlinks(e)• e.q. let c = wikicategory_Metropolitan_areas_of_Morocco
– InLinks(wikicategory_Metropolitan_areas_of_Morocco) = InLinks(Casablanca) InLinks(Marrakech) InLinks(Fes) ∪ ∪ ∪InLinks(Agadir) InLinks(Safi,_Morocco) InLinks(Oujda) ∪ ∪ ∪InLinks(Tangier) InLinks(Rabat)∪
Natural Language Questions for the Web of Data
19
InLinks(r)
• InLinks(r) = ∪(e1, e2) r ∈ (InLinks(e1) ∩ InLinks(e2))
Natural Language Questions for the Web of Data
20
Similarity Weights
• For entities– how often a phrase refers to a certain entity in
Wikipedia.• For classes– reflects the number of members in a class
• For relations– reflects the maximum n-gram similarity between
the phrase and any of the relation’s surface forms
Natural Language Questions for the Web of Data
21
Disambiguation Graph Processing
• The result of disambiguation is a subgraph of the disambiguation graph, yielding the most coherent mappings.
• We employ an ILP to this end.
Natural Language Questions for the Web of Data
22
Definitions (part1)
Natural Language Questions for the Web of Data
23
Definitions (part2)
Natural Language Questions for the Web of Data
24
objective function
Natural Language Questions for the Web of Data
25
Constraints(1~3)
Natural Language Questions for the Web of Data
26
Constraints(4~7)
Natural Language Questions for the Web of Data
27
Constraints(8~9)
This is not invoked for existential questions
Natural Language Questions for the Web of Data
28
resulting subgraph for the disambiguation graph of Figure 3
Natural Language Questions for the Web of Data
29
Query Generation
• not assign subject/object roles in triploids and q-units
• Example:– “Which singer is married to a singer?”• ?x type singer , ?x marriedTo ?y , and ?y type singer
Natural Language Questions for the Web of Data
30
5 Evaluation
• Datasets• Evaluation Metrics• Results & Discussion
Natural Language Questions for the Web of Data
31
Datasets• author's experiments are based on two collections of questions:
– QALD-1• 1st Workshop on Question Answering over Linked Data (QALD-1)• the context of the NAGA project
– NAGA collection• The NAGA collection is based on linking data from the Yago2 knowledge base
• Training set– 23 QALD-1 questions – 43 NAGA questions
• Test set– 27 QALD-1 questions – 44 NAGA questions
• Get hyperparameters (α, β, γ) in the ILP objective function.• 19 QALD-1 questions in Test set
Natural Language Questions for the Web of Data
32
Evaluation Metrics
• author evaluated the output of DEANNA at three stages– 1. after the disambiguation of phrases– 2. after the generation of the SPARQL query– 3. after obtaining answers from the underlying linked-data
sources• Judgement– two human assessors who judged whether an output item
was good or not– If the two were in disagreement , then a third person
resolved the judgment.
Natural Language Questions for the Web of Data
33
disambiguation stage
• The task of judges – looked at each q-node/s-node pair, in the context
of the question and the underlying data schemas, – determined whether the mapping was correct or
not – determined whether any expected mappings were
missing.
Natural Language Questions for the Web of Data
34
query-generation stage
• The task of judges– Looked at each triple pattern– determined whether the pattern was meaningful
for the question or not– whether any expected triple pattern was missing.
Natural Language Questions for the Web of Data
35
query-answering stage
• the judges were asked to identify if the result sets for the generated queries are satisfactory.
Natural Language Questions for the Web of Data
36
•Micro-averaging • aggregates over all assessed items
regardless of the questions to which they belong.
•Macro-averaging • first aggregates the items for the same
question, and then averages the quality measure over all questions.
•For a question q and item set s in one of the stages of evaluation
•correct(q, s) : the number of correct items in s•ideal(q) : the size of the ideal item set•retrieved(q, s) : the number of retrieved items
•define coverage and precision as follows:cov(q, s) = correct(q, s) / ideal(q)
prec(q, s) = correct(q, s) / retrieved(q, s).
Natural Language Questions for the Web of Data
37
Natural Language Questions for the Web of Data
38
Conclusions
• Author presented a method for translating natural language questions into structured queries.
• Although author’s model, in principle, leads to high combinatorial complexity, they observed that the Gurobi solver could handle they judiciously designed ILP very efficiently.
• Author’s experimental studies showed very high precision and good coverage of the query translation, and good results in the actual question answers.