Download - A language-independent method for the extraction of RDF verbalization templateslization - ppt spli-t
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
1 Institute of Applied Informatics and Formal Description Methods (AIFB), Karlsruhe, Germany
www.kit.edu
A language-independent method for the extraction of RDF verbalization templates
Basil Ell,1 Andreas Harth1
8th International Natural Language Generation Conference20 June 2014, Philadelphia, PA, USA
Institute of Applied Informatics and Formal Description Metthods (AIFB)
2
Motivation
More and more data openly available as RDF
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Linked Open Data initiative
Institute of Applied Informatics and Formal Description Metthods (AIFB)
3
Motivation
More and more data openly available as RDF
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Linked Open Data initiative
Search Engine
keywords, questions, etc.
Text
NLG
Institute of Applied Informatics and Formal Description Metthods (AIFB)
4
Motivation
More and more data openly available as RDF
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Linked Open Data initiative
Search Engine
keywords, questions, etc.
Text
NLG
Encyclopedia or Google Knowledge Graph
Textual description of a thing
NLG
Institute of Applied Informatics and Formal Description Metthods (AIFB)
5
Example RDF data - Triples
Subject Predicate Object
dbr:Curtain_(Novel) dbo:author dbr:Agatha_Christie
dbr:Curtain_(Novel) rdf:type dbo:Book
dbr:Curtain_(Novel) rdfs:label "Curtain (novel)"@en
dbr:Curtain_(Novel) dbp:releaseDate "September 1975"@en
dbr:Curtain_(Novel) rdf:type dbo:Writer
dbr:Curtain_(Novel) rdfs:label "Agatha Christie"@en
dbo:Book rdfs:label "book"@en
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
6
Example RDF data - Graph
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
7
Overview
Motivation
RDF Verbalization Templates
Automatic Template Extraction
Evaluation
Related Work
Summary
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
8
RDF VERBALIZATION TEMPLATES
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
9
RDF Verbalization Template (1/2)
Graph pattern (GP)
Sentence pattern (SP)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
10
RDF Verbalization Template (1/2)
Graph pattern (GP)
Sentence pattern (SP)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
11
RDF Verbalization Template (2/2)
GP represented as SPARQL
query
SELECT ?book_label ?book_type_label ?author_label ?book_rDWHERE { ?book dbo:author ?author . ?book dbp:releaseDate ?book_rD . ?book rdf:type ?book_type . ?book_type rdfs:label ?book_type_label . ?book rdfs:label ?book_label . ?author rdfs:label ?author_label . ?author rdf:type dbo:Writer .}
book_label = “Curtain (novel)"book_type_label = "book"author_label = "Agatha Christie"book_rD = "September 1975"
Curtain is a book by Agatha Christie published in September 1975.
Query results
Verbalization result
Subject Predicate Object
dbr:Curtain_(Novel) dbo:author dbr:Agatha_Christie
dbr:Curtain_(Novel) rdf:type dbo:Book
dbr:Curtain_(Novel) rdfs:label "Curtain (novel)"@en
dbr:Curtain_(Novel) dbp:releaseDate "September 1975"@en
dbr:Agatha_Christie rdf:type dbo:Writer
dbr:Agatha_Christie rdfs:label "Agatha Christie"@en
dbo:Book rdfs:label "book"@en
RDF data
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
12
AUTOMATIC TEMPLATE EXTRACTION
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
13
Template Extraction (1/6) - Overview
Parallel text-data corpus RDF verbalization templates
1. Sentence Collection2. Text-Data Alignment3. Abstraction4. Grouping5. Pattern Mining6. Template Creation
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
14
Template Extraction (1/6) - Overview
Parallel text-data corpus RDF verbalization templates
1. Sentence Collection2. Text-Data Alignment3. Abstraction4. Grouping5. Pattern Mining6. Template Creation
Experiment:Text from Wikipedia
Data from DBpedia
10 Virtual Machines8 vCPUs
8GB RAM
40GB Disk
Extraction ran for 2 weeksBasil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
15
Template Extraction (2/6) - Features
Distant-supervised No hand-labeled training data required
Simultaneus multi-relation learning Simultaneously learning all relations in a sentence
Frequent maximal subgraph pattern mining Identify commonalities among RDF graph patterns
Language independent Does not rely on syntactic parsing
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
16
Example Template (1/2)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
17
Example Template (2/2)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
18
Template Extraction (3/6) - Alignment
label
Sentencem1
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
19
Template Extraction (3/6) - Alignment
label
Sentencem1
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
20
Template Extraction (3/6) - Alignment
label
Sentencem1
m2 m3
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
21
Template Extraction (3/6) - Alignment
label
Sentencem1
m2 m3
i
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
22
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
i
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
23
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
i
i
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
24
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
i
i
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
25
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
m5
i
i
i label
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
26
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
m5
i
i
i label
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
27
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
m5
i
i
i label
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
28
Template Extraction (3/6) - Alignment
label
label
Sentencem1
m4
m2 m3
m5
i
i
i label
i
i
entity
literal
i
i
identified entity
identified literal
m1 modifier
matched string
Language independent approach:-> no syntactic parsing
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
29
Template Extraction (4/6) – Abstraction
Abstraction 1:
Abstraction 2:
Hypothesis graph pattern 1
Hypothesis graph pattern 2
pattern 1
pattern 2
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
30
Template Extraction (5/6) - Grouping
'"{V1}" is a short story by {V2}.': abstraction-64451-1 abstraction-88393-1 abstraction-4732-1 abstraction-50480-1
'"{V1}" is a single by American {V9} {V4} {V8}.': abstraction-22205-1 abstraction-22205-3 abstraction-72533-1 abstraction-127891-2
'{V1} (born {V2}) is a German footballer.': abstraction-86372-1 abstraction-86415-1 abstraction-135340-5 abstraction-140464-2
Hypothesis graph patterns with equivalent sentence pattern
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Group graph patterns with equivalent sentence patterns:
Institute of Applied Informatics and Formal Description Metthods (AIFB)
31
Template Extraction (6/6) - fmSpan
fmSpan - Frequent maximal subgraph pattern mining
Input:Set of graph patterns
Minimal coverage value: c
Output: Set of graph patternsEach graph pattern
Is subgraph to at least c graph patterns (→ frequent)
Cannot be extended while maintaining coverage (→ maximal)
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
32
EVALUATION
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
33
Evaluation (1/4) - Experiment
88,708,622 triples4,004,478 English documents 716,049 German documents
3,811,992 English sentences 794,040 German sentences
3,434,108 abstracted English sentences 530,766 abstracted German sentences (with at least two identified entities)
#groups≥5#templates #all groups
en 4569 3816 686,687
de 2130 1250 269,551
Parallel text-data corpus:
( , )
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
34
Evaluation (2/4) - Coverage
1.E+00-1.E+01
1.E+01-1.E+02
1.E+02-1.E+03
1.E+03-1.E+04
1.E+04-1.E+05
1.E+05-1.E+06
1.E+06-1.E+07
1.E+07-1.E+08
0
50
100
150
200
250
300
350
#en#de
How often can a template be
applied?
About 300 templates where each template can be used to verbalize between 10,000 and 100,000 subgraphs.
1 –
10
10 –
100
100
– 1
000
100
0 –
10
,00
0
10,0
00 –
100
,00
0
100
,00
0 –
1,0
00,0
00
1,00
0,00
0 –
10,
000,
000
10,0
00,0
00
– 1
00,
000,
000
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
35
Evaluation (3/4)
(1) (2) (3) (4)0
40
80
120
160
Accuracy (1)
en de
(1) (2) (3) (4)0
5
10
15
20
Accuracy (2)
en de
Is everything that is expressed in the graph
pattern also expressed in the sentence pattern?
Is everything that is expressed in the
sentence pattern also expressed in the graph
pattern?
Measured for each triple pattern within the GP:(1) The triple pattern is explicitly expressed(2) The triple pattern is implied(3) The triple pattern is not expressed(4) Unsure
(1) Everything is expressed(2) Most things are expressed(3) Some things are expressed(4) Nothing is expressed
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
10 English templates, 10 German templates,6 evaluators, 200 verbalizationsUser study
Institute of Applied Informatics and Formal Description Metthods (AIFB)
36
Evaluation (4/4)
(1) (2) (3) (4)0
50
100
150
200
250
Syntactical Correctness
en de
(1) (2) (3) (4) (5)0
50
100
150
200
250
300
Understandability
en de
How syntactically correct are
verbalizations?
How understandable are
verbalizations?
(1) Completely syntactically correct(2) Almost syntactically correct(3) Some syntactical errors(4) Strongly syntactically incorrect
(1) The meaning is clear(2) The meaning is clear, but there are some problems in word usage, and/or style(3) The basic thrust is clear, but the evaluator is not sure of some detailed parts because of word usage problems.(4) Contains many word usage problems, and the evaluator can only guess at the meaning(5) Cannot be understood at all
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
37
RELATED WORK
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
38
Related Work (1/4)
(Welty et al., 2010)Focus on IE
Input sentences are parsed
Regard relations between proper nouns only
Does not consider a graph of relations
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
39
Related Work (2/4)
(Duma and Klein, 2013)Focus on NLG
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
40
Related Work (3/4)
(Gerber and Ngomo, 2011)Focus on IE
< ’s acquisition of > pattern for property subsidiary
“Google’s acquisition of Youtube comes as online video is really starting to hit its stride.”
relation expressed by string between entities
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
41
Related Work (4/4)
Distant supervision(Craven and Kumlien, 1999), (Bunescu and Mooney, 2007), (Carlson et al., 2009), (Mintz et al., 2009), (Welty et al., 2010), (Hoffmann et al., 2011), (Surdeanu et al., 2012)
Simultaneus multi-relation learning(Carlson et al., 2009)
Institute of Applied Informatics and Formal Description Metthods (AIFB)
42
SUMMARY
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
43
Summary
Introduced RDF verbalization templates
Introduced template extraction approachDistant-supervised
Language independent
Simultaneous multi-relation learning
Frequent maximal subgraph pattern mining
EvaluationLarge parallel text-data corpus for en and de
Good syntactical correctness & understandability
Accuracy needs to be improved in future work
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
44
Thank you for your attention!
The authors acknowledge the support of the European Commission's Seventh Framework ProgrammeFP7-ICT-2011-7 (XLike, Grant 288342).
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
http://km.aifb.kit.edu/sites/bridge-patterns/INLG2014/
Institute of Applied Informatics and Formal Description Metthods (AIFB)
45
References (1/2)
Razvan Bunescu and Raymond Mooney. 2007. Learning to extract relations from the web using minimal supervision. In Annual meeting-association for Computational Linguistics, volume 45, pages 576–583.
Andrew Carlson, Justin Betteridge, Estevam R Hruschka Jr, and Tom M Mitchell. 2009. Coupling semi-supervised learning of categories and relations. In Proceedings of the NAACL HLT 2009 Workshop on Semi-supervised Learning for Natural Language Processing, pages 1–9. Association for Computational Linguistics.
Mark Craven and Johan Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Thomas Lengauer, Reinhard Schneider, Peer Bork, Douglas L. Brutlag, Janice I. Glasgow, Hans-Werner Mewes, and Ralf Zimmer, editors, ISMB, pages 77–86. AAAI.
Daniel Duma and Ewan Klein, 2013. Generating Natural Language from Linked Data: Unsupervised template extraction, pages 83–94. Association for Computational Linguistics, Potsdam, Germany.
Daniel Gerber and A-C Ngonga Ngomo. 2011. Bootstrapping the linked data web. In 1st Workshop on Web Scale Knowledge Extraction @ International Semantic Web Conference, volume 2011.
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, pages 541–550. Association for Computational Linguistics.
Basil Ell - A language-independent method for the extraction of RDF verbalization templates
Institute of Applied Informatics and Formal Description Metthods (AIFB)
46
References (2/2)
Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP 09, pages 1003–1011.
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 455–465. Association for Computational Linguistics.
Chris Welty, James Fan, David Gondek, and Andrew Schlaikjer. 2010. Large scale relation detection. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pages 24–33. Association for Computational Linguistics.
Basil Ell - A language-independent method for the extraction of RDF verbalization templates