towards generating text from discourse representation

Towards Generating Text from Discourse Representation Structures

Valerio BasileHumanities ComputingUniversity of [email protected]

Johan BosHumanities ComputingUniversity of [email protected]

Abstract

We argue that Discourse RepresentationStructures form a suitable level of language-neutral meaning representation for microplanning and surface realisation. DRSs can beviewed as the output of macro planning, andform the rough plan and structure for generat-ing a text. We present the first ideas of build-ing a large DRS corpus that enables the devel-opment of broad-coverage, robust text gener-ators. A DRS-based generator imposes vari-ous challenges on micro-planning and surfacerealisation, including generating referring ex-pressions, lexicalisation and aggregation.

1 Introduction

Natural Language Generation, NLG, is often viewedas a complex process comprising four main tasks(Bateman and Zock, 2003): i) macro planning,building an overall text plan; ii) micro planning, se-lecting referring expressions and appropriate contentwords; iii) surface realisation, selection of gram-matical constructions and linear order; and iv) phys-ical presentation, producing final articulation andlayout operations. Arguably, the output of the macroplanning component in an NLG system is some sortof abstract, language-neutral representation that en-codes the information and messages that need tobe conveyed, structured by rhetorical relations, andsupported by information that is presupposed to becommon ground.

We argue that the Discourse RepresentationStructures (DRSs) from Discourse RepresentationTheory (Kamp, 1984) form an appropriate represen-tation for this task. This choice is driven by boththeoretical and practical considerations:

• DRT, being a theory of analysing meaning, isby principle language-neutral;

• Many linguistic phenomena are studied in theframework provided by DRT;

• DRT has a model-theoretical backbone, allow-ing applications to perform logical inferenceswith the aid of theorem provers.

As a matter of fact, DRT has means to encode pre-supposed information in a principled way (Van derSandt, 1992), and connections with rhetorical rela-tions are spelled out in detail (Asher, 1993). More-over, the formal integration of DRS with named en-tities, thematic roles and word senses is natural.

These are, mostly, purely theoretical considera-tions. But in order to make DRSs a practical plat-form for developing NLG systems a large corpus oftext annotated with DRSs is required. Doing thismanually is way too costly. But given the develop-ments in (mostly statistical) parsing of the last twodecades we are now in a position to use state-of-the-art tools to semi-automatically produce gold (ornearly gold) standard DRS-annotated corpora.

Such a resource could form a good basis to de-velop (statistical) NLG systems, and this thought issupported by current trends in broad-coverage NLGcomponents (Elhadad and Robin, 1996; White etal., 2007), that take deep semantic representationsas starting points for surface realisation. The impor-tance of a multi-level resource for generation is un-derlined by Bohnet et al. (2010), who feel the lackof such a resource is hampering progress in the field.

In this paper we show how we are building sucha corpus (SemBank, Section 2), what the exact na-ture of the DRSs in this corpus is, and what phe-nomena are covered (Section 3). We also illustratewhat challenges it poses upon micro planning andsurface realisation (Section 4). Finally, in Section 5,we discuss how generating from DRSs relates to thetraditional NLG pipeline.

2 The Groningen SemBank

Various semantically annotated corpora of reason-able size exist nowadays: PropBank (Palmer et al.,

Figure 1: Screenshot of SemBank’s visualisation tool for the syntax-semantics interface combining CCG and DRT.

2005), FrameNet (Baker et al., 1998), the Penn Dis-course TreeBank (Prasad et al., 2005), and sev-eral resources developed for shared tasks such asCoNNL and SemEval. Annotated corpora that com-bine various levels of annotation into one formal-ism hardly exist. A notable exception is OntoNotes(Hovy et al., 2006), combining syntax (Penn Tree-bank style), predicate argument structure (based onPropBank), word senses, and coreference. Yet all ofthese resources lack a level comprising a formallygrounded “deep” semantic representation that com-bines various layers of linguistic annotation.

Filling this gap is exactly the purpose of Sem-Bank. It provides a collection of semantically anno-tated texts with deep rather than shallow seman-tics. Its goal is to integrate phenomena insteadof covering single phenomena into one formalism,and representing texts, not sentences. SemBankis driven by linguistic theory, using CCG, Com-binatory Categorial Grammar (Steedman, 2001),for providing syntactic structure, employing (Seg-mented) Discourse Representation Theory (Kamp,1984; Asher and Lascarides, 2003) as semanticframework, and first-order logic as a language forautomated inference tasks.

In our view, a corpus developed primarily forresearch purposes must be widely available to re-searchers in the field. Therefore, SemBank will onlyconsists of texts which distribution isn’t subject tocopyright restrictions. Currently, we focus on En-glish newswire text from an American newspaperwhose articles are in the public domain. In the fu-ture we aim to cover other text genres, possibly inte-grating resources from the Open American National

Corpus (Ide et al., 2010). The plan is to release astable version of SemBank in regular intervals, andto provide open access to the development version.

The linguistic levels of SemBank are, in order ofanalysis depth: part of speech tags (Penn tagset);named entities (roughly based on the ACE ontol-ogy); word senses (WordNet); thematic roles (Verb-Net); syntactic structure (CCG); semantic represen-tations, including events and tense (DRT); rhetoricalrelations (SDRT). Even though we talk about differ-ent levels here, they are all connected to each other.We will show how in the following section.

Size and quality are factors that influence the use-fulness of annotated resources. As one of the thingswe have in mind is the use of statistical techniquesin NLG, the corpus should be sufficiently large.However, annotating a reasonably large corpus withgold-standard semantic representations is obviouslya hard and time-consuming task. We aim to providea trade-off between quality and quantity, with a pro-cess that improves the annotation accuracy in eachperiodical stable release of SemBank.

This brings us to the method we employ to con-struct SemBank. We are using state-of-the-art toolsfor syntactic and semantic processing to provide arough, first proposal of semantic representation fora text. Among other tools, the most important arethe C&C parser (Clark and Curran, 2004) for syn-tactic analysis, and Boxer (Bos, 2008) for semanticanalysis. This software, trained and developed onthe Penn Treebank, shows high coverage for texts inthe newswire domain (up to 98%), is robust and fast,and therefore suitable for this task.

The output of these tools are corrected by crowd-

Table 1: Illustration of linguistic information integration in SemBankLevel Theory/Source Internal DRS Encodingsemantics DRT (Kamp and Reyle, 1993) drs(...,...)named entity ACE named(X,’Clinton’,per)thematic roles VerbNet (Kipper et al., 2008) rel(E,X,’Agent’)word senses WordNet (Fellbaum, 1998) pred(X,loon,n,2)rhetorical relations SDRT (Asher and Lascarides, 2003) rel(K1,K2,elaboration)

sourcing methods, comprising (i) a group of expertsthat are able to propose corrections at various lev-els of annotation in a wiki-based fashion; and (ii)a group of non-experts that provide information forthe lower levels of annotation decisions by way ofa Game with a Purpose, similar to the successfulPhrase Detectives (Chamberlain et al., 2008) andJeux de Mots (Artignan et al., 2009).

3 Discourse Representation Structures

A DRS comprises two parts: a set of discourse ref-erents (the entities introduced in the text), and a setof conditions, describing the properties of the ref-erents and the relations between them. We adoptwell-known extensions to the standard theory to in-clude rhetorical relations (Asher, 1993) and presup-positions (Van der Sandt, 1992). DRSs are tradition-ally visualised as boxes, with the referents placed inthe top part, and the DRS conditions in the bottompart. The convention in SemBank is to sort the dis-course referents into entities (variables starting withan x), events (e), propositions (p), temporalities (t),and discourse segments (k), as Figure 2 shows.

The DRS conditions can be divided into basic andcomplex conditions. The basic conditions are usedto describe names of discourse referents (named),concepts of entities (pred), relations between dis-course referents (rel), cardinality of discourse ref-erents denoting sets of objects (card), or to expressidentity between discourse referents (=). The com-plex conditions introduce embedded DRSs: implica-tion (⇒), negation (¬), disjunction (∨), and modal-ities (2, 3). DRSs are thus of recursive nature,and the embedding of DRSs restrict the resolution ofpronouns (and other anaphoric expressions), whichis one of the trade mark properties of DRT.

The aim of SemBank is to provide fully resolvedsemantic representations. Obviously, natural lan-guage expressions can be ambiguous and pickingthe most likely interpretation isn’t always straight-

forward: Some pronouns have no clear antecedents,word senses are often hard to distinguish, and scopeorderings are sometimes vague. In future work thismight give rise to adding some underspecificationmechanisms into the formalism.

DRSs are formal structures and come with amodel-theoretic interpretation. This interpretationcan be given directly (Kamp and Reyle, 1993) or viaa translation into first-order logic (Muskens, 1996).This is interesting from a practical perspective, be-cause it permits the use of efficient existing infer-ence engines developed by the automated deductioncommunity. Applying logical inference can play arole in tasks surrounding NLG (e.g., summarisation,question answering, or textual entailment), but alsodedicated components of NLG systems, such as gen-erating definite descriptions, which requires check-ing contextual restrictions (Gardent et al., 2004).

Figure 1 illustrates how SemBank provides thecompositional semantics of each sentence in the textin the form of a CCG derivation. Here each to-ken is associated with a supertag (a lexical CCGcategory) and its corresponding lexical semantics, apartial DRS. The CCG derivation, a tree structure,shows the compositional semantics in each step ofthe derivation, with the aid of the λ-calculus (the @operator denotes function application).

Table 1 shows how the various levels of annota-tion are integrated in DRSs. Thematic roles (Verb-Net) are implied by the neo-Davidsonian event se-mantics employed in SemBank, and are representedas two-place relations. The named entity types formpart of the basic DRS condition for names, and Wordsenses (WordNet) are represented as a feature on theone-place conditions for nouns, verbs and modifiers.Rhetorical relations are already part and parcel ofSDRT. Hence, SemBank provides all these differentlayers of information within a DRS. Figure 2 showsan SDRS for a small text of SemBank.

Figure 2: SDRS for the text “David Bowie grew up inKent. He played the saxophone. He was a singer in Lon-don blues bands”, as shown in SemBank.

4 Challenges for Text Generation

We believe that taking DRS as the basis for NLGwill introduce not only variants of known problems,but will impose many new challenges. Here we fo-cus on just three of them: generating referring ex-pressions, lexicalisation, and aggregation.

4.1 Generating referring expressionsViewed from a formal perspective, DRT is said to bea dynamic theory of semantics: the interpretation ofan embedded DRS depends on the interpretation ofthe DRSs that subordinate it — either be sentence-internal structure, or by the structure governed byrhetorical relations. A case in point is the treat-ment of anaphoric expressions including pronouns,proper names, possessive constructions and definitedescriptions.

In DRT, anaphoric expressions are resolved toa suitable antecedent discourse referent. Propernames and definite descriptions are too, but if find-ing a suitable antecedent fails then a process usu-ally referred to as presuppositional accommodationintroduces the semantic material of the anaphoricexpression on an accessible level of DRS (Van derSandt, 1992). The result of this process yields aDRS in which all presupposed information is explic-itly distinguished from asserted information. Thisgives rise to an interesting challenge for NLG.

A DRS corresponding to a discourse unit willcontain free variables for semantic material that ispresupposed or has been linked to the preceeding

context. On encountering such a free variable denot-ing an entity, the generator has a couple of choices inthe way it can lexicalise it: as a quantifier, pronoun,proper name, or definite description. Even thoughthe DRS context may provide information on namesand properties assigned to this free variable, we ex-pect it will be non-trivial to decide what propertiesto include in the corresponding expression. Text co-herence probably plays an important role here, butwhether thematic roles and rhetorical relations willbe sufficient to predict an appropriate surface formremains a subject for future research. It is also inter-esting to explore the insights from approaches dedi-cated to generating referring expressions using log-ical methods (van Deemter, 2006; Gardent et al.,2004) with robust surface realisation systems.

4.2 Aggregation

Coordinated noun phrases are known to be poten-tially ambiguous between distributive and collectiveinterpretations. A simple DRT analysis for the dis-tributive interpretation yields two possible ways togenerate strings: one where the noun phrases arecoordinated within one sentence, and one wherethe noun phrases involved are generated in sepa-rate sentences. For instance, the DRSs correspond-ing to “Deep Purple and Pink Floyd played at acharity show” (with a distributive interpretation) and“Deep Purple played at a charity show, and PinkFloyd played at a charity show”, would be equiv-alent. This is due to copying semantic material inthe compositional process of computing the mean-ing of the coordinated noun phrase “Deep Purpleand Pink Floyd”. (Note that the collective reading,as in “Deep Purple and Pink Floyd played together ata charity show” would not involve copying semanticmaterial, and would result in a different DRS, witha different interpretation.) It is the task of the ag-gregation process to pick one of these realisations,as discussed by White (2006). Doing this from thelevel of DRS poses an interesting challenge, becauseone would need to recognise that such an aggrega-tion choice is possible in the first place. Alterna-tively, instead of copying, one could use an explicitoperator that signals a distributive reading of a pluralnoun phrase, for instance as suggested by Kamp andReyle (1993). Arguably, this is required anyway toadequately represent sentences such as “Both DeepPurple and Pink Floyd played at a charity show”.

4.3 LexicalisationThe predicates (the one-place relations) found in aDRS correspond to concepts of an hierarchical on-tology. Time expressions and numerals have canon-ical representations in SemBank. The representationfor noun and verb concepts are based on the syn-onym sets provided by WordNet (Fellbaum, 1998).A WordNet synset can be referred to by its internalidentifier, or by any of its member word-sense pairs.For instance, synset 102999757 is composed ofthe noun–sense pairs strand-3, string-10,and chain-10. The lexicalisation challenge is toselect the most suitable word out of these possibili-ties. Local context might help to choose: a “stringof beads” is perhaps better than a “chain of beads”or “strand of beads”. As another example, con-sider the synset {loon-2,diver-3} representingthe concept for a kind of bird. American birdwatch-ers would use the noun “loon”, whereas in Britain“diver” would be preferred to name this bird.

5 Discussion

In the DRT-based framework that we propose forgenerating text, the issue arises where in the tradi-tional NLG pipeline DRSs play a role. In the intro-duction of this paper we suggested that DRSs wouldbe output by the macro planner, and hence fed as in-put to the micro planner. On the one hand this makessense, as in a segmented DRS all content to be gen-erated is present and the rhetorical structure is givenexplicitly. But then the question remains whether thetheoretical distinction between micro planning andsurface realisation really works in practice or wouldjust be counter-productive. Perhaps a revised archi-tecture tailored to DRS generation should be triedinstead. This issue is closely connected to the levelof semantic granularity that one would like to see ina DRS. We illustrate this by four examples:

• pronouns — we have made a particular pro-posal using free variables, but we could alsohave followed Kamp and Reyle (1993), intro-ducing explicit referents for pronouns;

• distributive noun phrases — as the discussionin Section 4.2 shows, it is unclear which repre-sention for distributive noun phrases would bemost suitable for the purpose of sentence plan-ning;

• sentential and verbal complements — shouldthere be a difference in meaning representationbetween “Tim expects to win” and “Tim ex-pects that he will win”?

• active vs. passive voice — should a meaningrepresentation reflect the difference betweenactive and passive sentences?

At this moment, it is not clear whether one wantsa more abstract DRS and give more freedom to sen-tence planning, or a more specific DRS restrict-ing the number of sentential paraphrases of its con-tent. Perhaps even an architecture permitting bothextremes would be feasible, where the task of mi-cro planning would be to add more constraints tothe DRS until it is specific enough for the surfacerealisation component to generate text from it. Itis even thinkable that such a planning componentwould take over some tasks of the macro planner,making the distinction between the two fuzzier.

A final point that we want to raise is a possiblerole that inference can play in our framework. DRSscan be structurally different, yet logically equiva-lent. This could influence the design of a genera-tion system and have a positive impact on its out-put. For instance, it would be thinkable to equip theNLG system with a set of meaning-preserving tran-formation rules that change the structure of a DRS,consequently producing different surface forms.

6 Conclusion

SemBank provides an annotated corpus combiningshallow with formal semantic representations fortexts. The development version is currently avail-able online with more than 60,000 automatically an-notated texts; the release of a first stable versioncomprising ca. 1,000 texts is planned later this year.We expect SemBank to be a useful resource to makeprogress in robust NLG. Using DRSs as a basis forgeneration poses new challenges, but also could of-fer fresh perspectives on existing problems in NLG.

AcknowledgmentsWe are grateful to Michael White, who provided us withuseful feedback to the idea of using a DRS corpus for de-veloping and training text generation systems. We alsowould like to thank the three anonymous reviewers ofthis article. They gave extremely valuable comments thatconsiderably improved our paper.

ReferencesGuillaume Artignan, Mountaz Hascoet, and Mathieu

Lafourcade. 2009. Multiscale visual analysis of lexi-cal networks. Information Visualisation, InternationalConference on, 0:685–690.

N. Asher and A. Lascarides. 2003. Logics of conver-sation. Studies in natural language processing. Cam-bridge University Press.

Nicholas Asher. 1993. Reference to Abstract Objects inDiscourse. Kluwer Academic Publishers.

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The Berkeley FrameNet project. In 36th AnnualMeeting of the Association for Computational Lin-guistics and 17th International Conference on Com-putational Linguistics. Proceedings of the Conference,pages 86–90, Universite de Montreal, Montreal, Que-bec, Canada.

John Bateman and Michael Zock. 2003. Natural Lan-guage Generation. In R. Mitkov, editor, Oxford Hand-book of Computational Linguistics, chapter 15, pages284–304. Oxford University Press, Oxford.

Bernd Bohnet, Leo Wanner, Simon Mille, and AliciaBurga. 2010. Broad coverage multilingual deep sen-tence generation with a stochastic multi-level realizer.In Proceedings of the 23rd International Conferenceon Computational Linguistics, pages 98–106.

Johan Bos. 2008. Wide-Coverage Semantic Analysiswith Boxer. In J. Bos and R. Delmonte, editors, Se-mantics in Text Processing. STEP 2008 ConferenceProceedings, volume 1 of Research in ComputationalSemantics, pages 277–286. College Publications.

John Chamberlain, Massimo Poesio, and Udo Kr-uschwitz. 2008. Addressing the Resource Bottle-neck to Create Large-Scale Annotated Texts. In JohanBos and Rodolfo Delmonte, editors, Semantics in TextProcessing. STEP 2008 Conference Proceedings, vol-ume 1 of Research in Computational Semantics, pages375–380. College Publications.

Stephen Clark and James R. Curran. 2004. Parsing theWSJ using CCG and Log-Linear Models. In Proceed-ings of the 42nd Annual Meeting of the Association forComputational Linguistics (ACL ’04), pages 104–111,Barcelona, Spain.

Michael Elhadad and Jacques Robin. 1996. An overviewof SURGE: a reusable comprehensive syntactic real-ization component. Technical report.

Christiane Fellbaum, editor. 1998. WordNet. An Elec-tronic Lexical Database. The MIT Press.

Claire Gardent, Helene Manuelian, Kristina Striegnitz,and Marilisa Amoia. 2004. Generating definite de-scriptions: Non-incrementality, inference and data. InThomas Pechmann and Christopher Habel, editors,Multidisciplinary approaches to language production,pages 53–85. Walter de Gruyter, Berlin.

Eduard Hovy, Mitchell Marcus, Martha Palmer, LanceRamshaw, and Ralph Weischedel. 2006. Ontonotes:the 90% solution. In Proceedings of the Human Lan-guage Technology Conference of the NAACL, Com-panion Volume: Short Papers, pages 57–60, Strouds-burg, PA, USA.

Nancy Ide, Christiane Fellbaum, Collin Baker, and Re-becca Passonneau. 2010. The manually annotatedsub-corpus: a community resource for and by the peo-ple. In Proceedings of the ACL 2010 Conference ShortPapers, pages 68–73, Stroudsburg, PA, USA.

Hans Kamp and Uwe Reyle. 1993. From Discourse toLogic; An Introduction to Modeltheoretic Semantics ofNatural Language, Formal Logic and DRT. Kluwer,Dordrecht.

Hans Kamp. 1984. A Theory of Truth and SemanticRepresentation. In Jeroen Groenendijk, Theo M.V.Janssen, and Martin Stokhof, editors, Truth, Inter-pretation and Information, pages 1–41. FORIS, Dor-drecht – Holland/Cinnaminson – U.S.A.

Karin Kipper, Anna Korhonen, Neville Ryant, andMartha Palmer. 2008. A large-scale classification ofEnglish verbs. Language Resources and Evaluation,42(1):21–40.

Reinhard Muskens. 1996. Combining Montague Seman-tics and Discourse Representation. Linguistics andPhilosophy, 19:143–186.

Martha Palmer, Paul Kingsbury, and Daniel Gildea.2005. The proposition bank: An annotated corpus ofsemantic roles. Computational Linguistics, 31(1):71–106.

Rashmi Prasad, Aravind Joshi, Nikhil Dinesh, Alan Lee,Eleni Miltsakaki, and Bonnie Webber. 2005. ThePenn Discourse TreeBank as a resource for natural lan-guage generation. In Proc. of the Corpus LinguisticsWorkshop on Using Corpora for Natural LanguageGeneration, pages 25–32.

Mark Steedman. 2001. The Syntactic Process. The MITPress.

Kees van Deemter. 2006. Generating referring expres-sions that involve gradable properties. ComputationalLinguistics, 32(2):195–222.

Rob A. Van der Sandt. 1992. Presupposition Projectionas Anaphora Resolution. Journal of Semantics, 9:333–377.

Michael White, Rajakrishnan Rajkumar, and Scott Mar-tin. 2007. Towards broad coverage surface realizationwith CCG. In Proceedings of the Workshop on UsingCorpora for NLG: Language Generation and MachineTranslation (UCNLG+MT).

Michael White. 2006. Efficient realization of coordinatestructures in combinatory categorial grammar. Re-search on Language & Computation, 4:39–75.

towards generating text from discourse representation

Documents