qall-me – qall-me framework and textual …qallme.fbk.eu/qallme-edits.pdfqall-me – qall-me...

24
Qall Qall-M e e – Qall Qall- Me Framework and Textual Me Framework and Textual Enta Entai lment Based lment Based Question Answering Question Answering Author: Author: Qall Qall- Me Consortium Me Consortium Affiliation: Affiliation: FBK, University of Wolverhampton FBK, University of Wolverhampton , University , University of Alicante, DFKI of Alicante, DFKI , Comdata, Ubiest, Waycom , Comdata, Ubiest, Waycom Keywords: Keywords: question answering, question answering, textual entailment, Qall textual entailment, Qall- Me Me framework, EDITS system framework, EDITS system Abstract: Abstract: This report presents This report presents the main ideas underlying the Textual the main ideas underlying the Textual Entailment approach for Questions Answering developed in the context of the Entailment approach for Questions Answering developed in the context of the Qall Qall- Me project. First we present the general architecture Me project. First we present the general architecture of the system (the of the system (the Qall Qall- Me fra Me fra me work) me work) and then we focus on the and then we focus on the Textual Entailment compo Textual Entailment compo nent nent (i.e. the EDITS platform). (i.e. the EDITS platform). FP6 IST-033860 http://qallme.itc.it

Upload: phammien

Post on 19-Mar-2018

240 views

Category:

Documents


0 download

TRANSCRIPT

QallQall -- MM e e –– QallQall -- Me Framework and Textual Me Framework and Textual EntaEnta ii lment Basedlment Based Question AnsweringQuestion Answering

Author:Author: QallQall -- Me ConsortiumMe Consortium

Affiliation:Affiliation: FBK, University of WolverhamptonFBK, University of Wolverhampton , University , University

of Alicante, DFKIof Alicante, DFKI , Comdata, Ubiest, Waycom, Comdata, Ubiest, Waycom

Keywords:Keywords: question answering, question answering, textual entailment, Qalltextual entailment, Qall -- Me Me

framework, EDITS systemframework, EDITS system

Abstract: Abstract: This report presents This report presents the main ideas underlying the Textual the main ideas underlying the Textual

Entailment approach for Questions Answering developed in the context of the Entailment approach for Questions Answering developed in the context of the

QallQall -- Me project. First we present the general architecture Me project. First we present the general architecture of the system (the of the system (the

QallQall -- Me fraMe frame work) me work) and then we focus on the and then we focus on the Textual Entailment compoTextual Entailment compo nent nent

(i .e. the EDITS platform).(i.e. the EDITS platform).

FP6 IST-033860

http://qallme.itc.it

FP6 IST-033860

D4.3

1.   Introduction ....................................................................................................1  

2.   QALL-ME Framework Overview ................................................................2  

a.   Framework Architecture ...................................................................................3  i.   Question Answering Workflow....................................................................3  

ii.   System Components ....................................................................................4  

b.   Comparison to the QALL-ME Prototypes........................................................5  c.   Implementing with the Framework...................................................................6  3.   Applications of the QALL-ME Framework.................................................7  

a.   Integration into the QALL-ME Showcase........................................................7  4.   The EDITS Textual Entailment System .....................................................10  

4.1. EDITS: a general-purpose RTE package........................................................10  4.1.1. Configuration and I/O formats................................................................12  

4.1.2. Language-dependent/independent components ......................................12  

Usage documentation........................................................................................13  

4.2. Evaluation of the EDITS package ..................................................................13  4.2.1. Experiment 1: Linear Distance Vs Tree Edit Distance...........................14  

4.2.2. Experiment 2: dealing with Italian spoken requests ...............................17  

4.2.3. Experiment 3: on field evaluation...........................................................18  

4.3. Automatic acquisition of costs for edit operations .........................................20  4.4. Open-source release of the EDITS package ...................................................20  References.............................................................................................................21  

FP6 IST-033860

Page 1 of 24

1.1. IntroductionIntroduction

The QALL-ME Framework provides a reusable architecture for multilingual Question Answering (QA) systems over structured data belonging to a certain domain. The tar- get domain is modeled by an ontology, which is used as the main resource to enable multilinguality. The framework takes up-to-date information into consideration by using temporal and spatial reasoning at the time of processing the inquiry. Finally, it finds out the mapping between questions and answers using a novel approach focused on textual entailment recognizers. The framework is based on a Service Oriented Architecture (SOA), which is realized using web services. The following are the main characteristics of the Qall-Me framework: Multilinguality. The QALL-ME Framework supports questions and answers in several different natural languages. Answers can be retrieved crosslingually, that is, the answers can be in an- other language than the language of the question. This capability is addressed by a language identification module, which detects the language of the question and by the spatial context of the inquiry, which is used to infer the language of the answers. Currently, the framework comes with demo components for German, Spanish and English questions, however, its design permits an easy extension by adding the necessary language-dependent components for a new language. For example, in the QALL-ME project we have also implemented an Italian language subsystem. Data Sources. The data sources used to retrieve the answers are usually databases or simply XML 4 documents with a specific structure. With regard to our framework, the used data structures have to be accessible via predefined RDF interfaces. These interfaces or RDF schemas are specified by the domain ontology that is used. This implies that the answer data sources are always bound to a certain domain, which, however, can be freely specified by exchanging or extending the domain ontology. Ontology. To use the QALL-ME Framework, an ontology containing concept descriptions for the target domain and descriptions of possible relations between these concepts has to be provided. The ontology is then used in two ways: (i) as a schema for representing the structured answer data described as instances of the ontology concepts; and (ii) to cross the language barrier in multilingual QA. By describing the answer data with the ontology vocabulary, we have a representation of the data, which is independent of the original language of the data. Therefore, by creating a mapping from the original question to a query, which uses the ontology vocabulary, we can apply that query then to the answer data and cross the language barrier. Context Awareness. By default all questions are anchored in space and time, meaning that every question always has a spatial and temporal con- text. For instance, using deictic expressions such as “here” or “tomorrow” a question posed at eight o’clock in Berlin may potentially mean something completely different than the same question posed at five o’clock in Amsterdam. Textual Entailment. Behind the scenes, the framework addresses the mapping between a natural language question and a database query using Recognizing Textual Entailment (RTE) techniques. RTE techniques allow us to deal with the language

FP6 IST-033860

Page 2 of 24

variability expressed within the questions through semantic inferences at the textual level. RTE is defined as a generic framework for modeling semantic implications where a concrete meaning is described in different ways. An RTE component can recognize whether some text T entails a hypothesis H, i.e., whether the meaning of H can be fully derived from the meaning of T. In the context of our framework H is the minimal form of a question about some topic and T is the question, which has to be answered. The set of minimal questions has to be defined in advance, i.e., each minimal question that shall be answerable has to be in the set. Through RTE we can then handle all reformulations of these minimal questions by semantically associating a new input question with one of the predefined minimal questions. In order to make the RTE inference more general, both the minimal question and the input question are converted into patterns by replacing certain ontology entities with suitable placeholders. To complete the retrieval procedure, we simply attach the corresponding database query that returns the correct answer(s) to each minimal question. Therewith, the entailment components will determine which minimal question is entailed by the input question, and consequently the associated database query will provide the answer(s). Service Oriented Architecture. It is worth mentioning that the framework is based on a SOA. Such an architecture enables distributed computing and enforces loose-coupling of system components, the so-called services. Every service can be seen as a “black box” with a specialized functionality, which is accessible only via standardized interfaces. An orchestration of these services creates a business process. In our specific case, the business process is a QA system and the service components are realised as web service implementations such as entity annotators, query generators or textual entailment recognizers. Orchestrations of different web service implementations create different QA systems and existing web service implementations can easily be reused in different systems.

2.2. QALLQALL-- ME Framework OverviewME Framework Overview The general concepts that make up the QALL-ME Question Answering Framework are the following:

• cross-lingual QA; in the QALL-ME project itself four languages were supported: English, German, Italian and Spanish

• use of structured answer data sources, typically represented in the open RDF format (Prud’hommeaux and Seaborne, 2008)

• use of domain ontologies which describe each QA system’s domain; for the QALL-ME project a tourism domain specific ontology has been created (Ou et al., 2007; Ou, 2008)

• support for the spatiotemporal anchoring of questions

• use of the recognition of Textual Entailment (RTE) for QA (Negri, 2007) • use of a Service Oriented Architecture (SOA) which is realized with web

service (WS) technology (Sacaleanu and Spurk, 2007; Spurk and Neumann, 2008)

In this section we provide an overview of the framework. We start with a description of the framework architecture in Section a. Next, in Section b, we summarize how the

FP6 IST-033860

Page 3 of 24

framework compares to the QALL-ME project prototypes. Section c concludes this section with a presentation of how QALL-ME Framework adopters are intended to use the framework and what kind of software, documentation and support is provided for them.

a.a. Framework ArchitectureFramework Architecture

The QALL-ME Framework is implemented on the basis of a SOA with web service (WS) technology. In this section we give a conceptual overview of the QALL-ME Framework’s system architecture. We start with a description of the workflow between the QA components of the system in Section i before we have a closer look at the components themselves in Section ii.

i .i . Question Answering WorkflowQuestion Answering Workflow

The QALL-ME Framework QA system workflow is managed by the QA planner. In Figure 1 you can find a conceptual view on this workflow in the planner. As can be seen, this workflow is pretty linear: starting from the reception of the input question with context, the planner processes the inquiry until it has an answer. The QA planner has five input parameters of which three are somehow given in the user’s inquiry: the asked question as well as a spatiotemporal context in form of the current time and the user’s location. The two other input parameters can be considered to be more or less static as they usually don’t change with the inquiry: a collection of facts that is inserted into the answers database and a collection of pattern mappings. The latter is entered into a static pattern store which is used in the Query Generation component. The collection of pattern mappings is either created manually – as has been done in the QALL-ME project up to the second prototype system – or a suitable learning subsystem has to be developed which creates the required mappings automatically1. The single output parameter of the QA planner is the found answer. The workflow which is depicted in Figure 1 represents both the monolingual QA workflow as well as the cross-lingual QA workflow. Depending on the language of the input question and the user’s location, the QA planner selects appropriate component implementations on the way to the answer. There are three kinds of components:

• language-specific: the component is implemented once for each language that the QA system shall be able to handle for input questions; the QA planner has to choose from these implementations

• location-specific: the component is implemented once for each location at which the QA system shall be able to answer questions; this is in line with the spatial anchoring of questions; the QA planner has to choose from these implementations

• system-wide: the component is implemented only once for the whole QA system, i.e., it has language and location independent functionality and the QA planner uses this implementation for all input questions

1 Such automatic pattern mapping learning has been a research topic in the third development cycle (cf. section Error! Reference source not found.) and is described in full detail in the QALL-ME project deliverable D3.3.

FP6 IST-033860

Page 4 of 24

Figure 1 Conceptual view on the QA workflow in the QA planner as a UML 2 activity diagram.

In Figure 1 you may find all language- and location-specific components annotated with a relevant UML 2 comment; all other components are system-wide components and implementations of these are provided by the core framework. In the following section the components in the depicted QA workflow are presented in more detail.

i i .i i . System ComponentsSystem Components

In this section we present a description of the components that show up in the conceptual QA workflow of the QALL-ME Framework as depicted in Figure 1. As soon as a user asks a question – no matter through which UI –, then the QA planner receives the user’s inquiry consisting of the question and the question’s context (cf. Receive Question and Context component). Here the inquiry is put into a question object QObj on which the planner works in the following steps. A first addition to the QObj is then the language of the inquiry which is automatically recognized by a Language Identification component. Depending on the location of the inquiry, the QA planner now selects an appropriate Entity Annotation component to let it annotate entity names in the question which depend on the available answer data. Such entity names might be cinema names or

FP6 IST-033860

Page 5 of 24

names of movies which can be seen in local cinemas. Thus, the types of entities that have to be annotated here depend on the domain of the QA system which is built. A second annotation component follows in the QA planner’s workflow: a Term Annotation component is selected according to the question language and annotates language specific terms that are relevant for the QA system’s domain. The difference to the previously employed entity annotation component is that the latter will usually annotate named entities only that are more or less language independent. The term annotator, however, annotates terms that are used to express concepts only in certain languages. Consider for example hotel facilities: terms for concepts like TV, swimming pool, safe, etc. are represented differently in different languages, e.g., an English “hair dryer” is equivalent to the German “Fön”. Location specific (named) entities like “Berlin” or “New York” are mostly the same, however, even when they are used in different languages; that’s why there is the Entity Annotation component. To complete the annotation of the input question, the QA planner selects an appropriate Temporal Expression Annotation component which is used to annotate temporal expressions in the input. Such temporal expressions are similar to the previously mentioned terms in that they are highly language specific. For example, the English “yesterday” and the German “gestern” refer to the same time when used in the same context. One last thing should be noted regarding the three annotation components: they not only annotate but also find a canonical, language independent or normalized form of each annotated entity, term or expression. A language dependent temporal expression, for example, is normalized to a common format like TIMEX22 or ISO 8601. With these canonical forms it is a lot easier for the following Query Generation component to translate that expression into an answer search query. After the annotations, the QA planner is now ready to generate a query for finding an answer to the input question. It chooses a suitable Query Generation component for the question language and – using a component for the Recognition of Textual Entailment (RTE) – it then generates a SPARQL query that is applicable to the available answer data. As detailed in Negri (2007) and other QALL-ME project documents, this step is one of the most important steps of the workflow in the whole QALL-ME Framework because it builds the actual bridge between information from the input question and potential answer data. If no query could be generated, then the user has probably posed an out-of-domain question and an empty answer is returned (cf. Create Empty AObj component). Otherwise the SPARQL query is fed into an Answer Retrieval component. A suitable component implementation for this is again selected by the QA planner according to the spatial context of the question. The answer retrieval component is connected to a location dependent RDF fact database from which it retrieves answers that the QA planner eventually returns.

b.b. Comparison to the QALLComparison to the QALL-- ME PrototypesME Prototypes

Although the QALL-ME Framework stems from the QALL-ME prototype systems, it still differs from them in many ways. The most important difference is certainly that the QALL-ME Framework is not a complete QA application – and not even a complete QA system. The QALL-ME prototype systems are QA applications in that they embed a cross-lingual QA system in a complete application scenario. For example, there are different

2 TIMEX2: http://timex2.mitre.org/annotation_guidelines/2005_timex2_standard_v1.1.pdf

FP6 IST-033860

Page 6 of 24

input and output modalities and the applications have a certain value from an end-user perspective: the applications answer natural language questions in a restricted scenario without any need for the user to know anything about the internals of the applications; the prototype systems can simply be used right away. The QALL-ME Framework is not targeted at end-users at all. There is neither a user interface nor any fixed scenario which could guide the user in any way. In addition, the QALL-ME prototype systems contain complete cross-lingual QA systems. These systems would also work as stand-alone programs, however, from an end-user perspective they would probably not be usable. Theses systems provide just the raw QA functionality without any pleasant user interface for entering questions and presenting answers. Yet, the systems are fully functional for a certain domain. The QALL-ME Framework is not a QA system but only the skeleton of such a system. As such it is not executable right from the start but it is also not tied to a certain domain.3 Thus, the QALL-ME Framework is just an architecture skeleton for cross-lingual QA systems. The created QA systems can answer questions with the help of structured answer data sources from freely specifiable domains. The QALL-ME prototype QA systems can be seen as instantiations of the skeleton that the QALL-ME Framework provides. In this spirit the QALL-ME Framework website features the QA system of the second QALL-ME prototype as an online demo.4

c.c. Implementing with the FrameworkImplementing with the Framework

There are several approaches that can be used to start developing with the QALL-ME Framework. As most people will not be familiar with QALL-ME technology, it is probably best for them to follow one of the provided tutorials to get a first impression of how working with the QALL-ME Framework looks like. We are providing two kinds of software packages for download:

• The QALL-ME Framework SDK (Software Development Kit) is the major package of the QALL-ME Framework. It is mandatory for people who wish to develop their own QA system with the framework. The SDK contains the source code of the system-wide components of the framework as well as all WSDL interface definitions of the framework’s web service components. Furthermore the package contains dependency libraries for the framework’s system-wide component implementations. Finally, people can find the full documentation from the website for offline reading in the releases, too.

• The DemoKit is an optional, yet very helpful package which contains demo implementations of the QALL-ME Framework components. This encompasses pre-compiled WAR files5 for all system-wide framework components and demo implementations for QA subsystems for three different languages and locations (England, Germany and Spain with their corresponding official languages). The DemoKit is not only intended for users who are just trying out the QALL-ME Framework but it also contains the sources of the demo component implementations which may greatly help developers to get started using the framework for their own QA systems.

3 Notwithstanding that, the QALL-ME Framework comes with a collection of demo component implementations which provide for a complete demo QA system that an interested developer can easily and quickly get running. See Section 2.c for more information. 4 The online demo can be found at http://qallme.sourceforge.net/onlinedemo.html. 5 WAR (Web ARchive) files can be directly deployed on any Java EE 5 compliant application server without further work, i.e., the packaged web application – web services in our case – can be easily tried out.

FP6 IST-033860

Page 7 of 24

So most developers will probably first have a look at the DemoKit package and the “Getting Started” tutorial on the website. If they are determined to build their own QA system on the QALL-ME Framework, then they will have to examine the SDK package. In addition they may find it useful to follow further, more in-depth tutorials on the website. These tutorials are part of thorough documentation which comes with the framework and which additionally provides overviews of relevant QALL-ME technologies and a detailed technical manual for the framework components. In case there should be questions about the documentation or if a QALL-ME Framework adopter should have any more specific questions than the current documentation answers, then the project mailing list can be used. Most of the core developers of the framework are subscribed to this list and will provide help and support there. People who are not only developing with the QALL-ME Framework but who would also like to develop the framework itself further are also kindly welcomed and supported. In the project documentation we provide an extra chapter on possible extensions of the QALL-ME Framework where interested developers can pick a task of their choice for enhancing the framework as a whole. Additionally they could peruse the feature requests in the project’s ticket system and attach patches to them which provide the requested functionality.

3.3. ApplicatiApplicati ons of the QALLons of the QALL-- ME FrameworkME Framework Despite its rather young age, the QALL-ME Framework has already been used in several (prototype) applications and research experiments. In this section we briefly present the QALL-ME Showcase, a prototypical mobile device application developed in the QALL-ME project for answering context-sensitive questions with the help of a question answering (QA) system instance that is based on the QALL-ME Framework.

a.a. Integration into the QALLIntegration into the QALL -- ME ShowcaseME Showcase

The QALL-ME Showcase is a prototypical mobile device application, which shows the achievements of the QALL-ME project as a whole from an end user perspective. Among other things, the application provides a mobile interface to the QALL-ME QA system which is based on the QALL-ME Framework. The communication between the QALL-ME mobile client and its services has been implemented by defining and implementing a communication protocol, which is based on XML syntax; the protocol provides for natural language user queries. The XML data could, in principle, contain contextual information such as the current GPS position of the user, the current kind of usage of the mobile application (in a car or for pedestrian navigation) and the user’s preferred answer language (English or Spanish). Our client solution consists of a library that exposes an interface to the graphical presentation. This library allows the application to perform the following main tasks: i) connect to the server; ii) run speech recognition and speech synthesis sessions; iii) start query sessions to the QALL-ME Framework. The communication protocol provides the deployment of an XML structure that contains the client request. Example request: <?xml version="1.0" encoding="us-ascii"?>

FP6 IST-033860

Page 8 of 24

<VPIN xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<CUST ID="4924.1.ot4Ig9" TEL="10001" /> <VER ID="MB3.0.0" /> <AMM DMAX="10000" X="0.0" Y="0.0" PMAX="20" /> <POI LANG="ITA" DESC=" What is the plot of the film, the big dream ?" MODALITY="VOICE" /> <SENDER />

</VPIN>

The client receives a response XML structure, which lists the records for answers: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<VPOUT RCOD="0000">

<POIINF LANG="ITA">

<POI DESC_TTS=":: Nicola young hopeful man coming from Puglia and moved to Rome in 1966 to attend the Academy of Dramatic Art…. " DESC=" Nicola young hopeful man coming from Puglia and moved to Rome in 1966 to attend the Academy of Dramatic Art…. " />

</POIINF> <EL />

</VPOUT>

In the response structure two different fields of data presentation have been provided: “DESC_TTS” contains the text of the response that can be read by the voice synthesizer; “DESC” contains the text to be displayed on the display of the client device. This differentiation has been made to be able to insert special tags into the output which are interpreted by speech synthesizers and which may not be visible in the graphical presentation on the mobile device. These special tags allow for example to manage the prosody of the delivery or to indicate the language with which certain words contained in a sentence should be pronounced.

FP6 IST-033860

Page 9 of 24

Figure 2. Architecture of the QALL-ME Showcase prototype.

During the last year we have developed a new version of the showcase architecture where the conversation between the application on the client and the QALL-ME Framework service is mediated by a new component on the server side. This component was realized as a proxy component that receives requests from the client. The client sends the server the text of the query (as an outcome resulting from a previous session of speech recognition or text written by hand), the language and its GPS coordinates if available. The server connects to the QALL-ME Framework service and forwards the responses to the client in pass-through mode (cf. Figure 2). The latest version of the client library, which is integrated in the user application provides a list of methods and corresponding events specifying the outcome and the results of the user’s query. For example, the execution of a query session towards the QALL-ME Framework is activated by the client through the implementation of the executeQuery method: executeQuery('dove fanno cosmonauta?', GPS_coordinates)

The result of this query might look like this: onQueryResult('<?xml version="1.0"encoding="UTF-8" standalone="yes"?><VPOUT RCOD="0000"><POIINF LANG="ITA"><POI Y="46.075587" X="11.119152" TEL="0461/829002" NUM="" NAME="Cinema Astra" ID="Cinema_-1038660544" ICON="7832" GRP="7832" DIST="" DESC_TTS="Cinema Astra 23 Settembre 2009 19.30 corso Buonarroti 14 " DESC="Cinema Astra 23 Settembre 2009 19.30 corso Buonarroti 14" AMD4="Trento" AMC1="ITA" ADDR="corso Buonarroti 14"/></POIINF><EL/></VPOUT>')

The graphical representation of the client–server communication for the given example question is depicted in the screenshots in Figure 3.

FP6 IST-033860

Page 10 of 24

(1) (2) Figure 3: Client–server communication from the user’s point of view for the question “dove fanno

cosmonauta?” (“Where can I see the movie Cosmonaut?”): (1) the connection to the server is being established; (2) the retrieved answer is presented.

4.4. The EDITS Textual Entailment SystemThe EDITS Textual Entailment System According to the Textual Entailment (TE) framework adopted in the Qall-Me project, the Question Answering problem is recast as an entailment recognition problem, where the text (T) is the question, and the hypothesis (H) is represented by textual material (either a full question or a single relation pattern) associated to instructions (SPARQL queries to a database) for retrieving the answer to the input question. The general implementation of such architecture was driven by the project’s aim to “develop a shared infrastructure for multilingual QA based on a Web Services architecture”. To fit this requirement, each Answer Extraction component was designed to be pluggable in a distributed architecture. Given an input question q, the QALL-ME central QA planner interacts with four separate language-specific Answer Extractors on the basis of shared procedures and I/O formats. More specifically, once the central planner has processed q in its source language, the search for the answer is routed to the appropriate Entailment-based Answer Extractor. While the modalities of interaction with the planner are shared, each Answer Extractor implements its own entailment-recognition strategy.

4.1.4.1. EDITS:EDITS: aa generalgeneral -- purposepurpose RTERTE packagepackage

FP6 IST-033860

Page 11 of 24

The development of a general-purpose RTE package has two main objectives. The first one is to provide all the partners with a shared RTE component for comparative evaluation with their own systems and data collections. The second objective is to promote the QALL-ME activities in the NLP community by making a flexible entailment package freely available for research purposes. To achieve these objectives, the RTE package should feature:

• Easy download and installation; • Well defined IO formats, and configuration files; • A clear distinction between the core language-independent components (that

can run on all the data stored in the QALL-ME benchmark), and additional language-specific resources, that should be easily pluggable to boost performance on a specific language;

• Complete and updated usage documentation.

Keeping the aforementioned requirements in mind, FBK-irst developed EDITS (Edit Distance Textual Entailment Suite), a general-purpose RTE package. The system implements a distance-based approach for recognizing textual entailment: given a Text-Hypothesis (T-H) pair, the output consists in one of two judgments, either TRUE or FALSE, supported by a confidence score. The output is obtained by applying edit distance algorithms, i.e. by computing the distance between T and H in terms of the cost of the edit operations (insertion, deletion and substitution) that are necessary to transform T into H. The edit distance approach adopted in EDITS builds on three core components: • An edit distance algorithm that defines which portions of T and H are

considered for edit operations, and in which order; • A cost schema for the three edit operations, which allows the algorithm to

optimize the overall cost of transforming T into H; • A set of entailment rules providing specific knowledge (e.g. lexical, syntactic)

that complements the cost schema for the edit operations. EDITS allows the user to configure the three components resulting in a flexible and modular approach. Given a certain configuration of its basic modules, EDITS can be trained over a specific RTE dataset in order to optimize its performance. In the training phase EDITS estimates the distance threshold S, 0<S<1 that best separates positive and negative examples in the training data. At the test stage, EDITS applies the threshold S to decide whether an entailment relation holds between a given T-H pair. Pairs resulting in a distance above S are classified as FALSE, while pairs below S are classified as TRUE. Given the edit distance ED(T,H) for a T-H pair, a normalized entailment score is produced using the following function: Entailment(T,H) = ED(T,H) / (ED(T,_) + ED(_,H)) where ED(T,H ) is the function that calculates the edit distance between T and H and (ED(T,_) +ED(_,H)) is the distance equivalent to the cost of inserting the entire text

FP6 IST-033860

Page 12 of 24

of H and deleting the entire text of T. The entailment score function has a range from 0 (when T is identical to H), to 1 (when T is completely different from H).

4.1.14.1.1 . Configuration and I/O formats. Configuration and I/O formats

EDITS can be configured using a simple XML configuration file available in the directory share/conf.xsd. The simplest way for getting started with EDITS is to choose a pre-defined configuration among those available in a specific examples directory (these configurations represent a variety of simple settings, which implement basic approaches to RTE). A configuration file contains two sections:

A list of naming declarations, for the modules that are used inside the configuration file. A declaration, for example, can be a path to data files.

A list of configuration attributes for each module used by the package, including parameters of the modules, and dependencies from other modules.

Once EDITS has been configured it can be run over a training RTE dataset using the “bin/train” line command. An RTE dataset is a set of T-H pairs expressed in the same XML format defined in the RTE evaluation campaigns, where each pair is annotated with an entailment judgment (i.e. TRUE/FALSE). Once a model file is available, EDITS can be run over a test RTE file (in the same XML format used to represent the training data) using the “bin/test” line command.

4.1.24.1.2 . Language. Language -- dependent/independent components dependent/independent components

As stated before, the core of the EDITS package consists in three core components, namely: • An edit distance algorithm that defines which portions of T and H are

considered for edit operations, and in which order; • A cost schema for the three edit operations, which allows the algorithm to

optimize the overall cost of transforming T into H. • A set of entailment rules providing specific knowledge (e.g. lexical, syntactic)

that complements the cost schema for edit operations; The system allows for both language-independent, and language-specific usage modalities, which depend on how the three components are configured (e.g. depending on the type of input they handle – tokens, lemmas, parts of speech, syntactic trees). As far as the edit distance algorithms are concerned, EDITS implements two basic algorithms, namely Linear Distance, and Tree Edit Distance. The former (see D5.1 for a thorough description) is substantially language-independent (as it only requires a tokenizer), and is applied by converting the text of the sentences T and H into sequences of “words” (i.e. tokens, lemmas, or parts of speech). Accordingly, the edit operations are defined as: i) Insertion - insert a word from H into T, ii) Deletion - delete a word from T, and iii) Substitution - substitute a word from T with a word from H. The second algorithm (Tree Edit Distance) is a language-dependent edit distance algorithm (a parser is needed to process the input T-H pairs) that, instead of operating on words, considers dependency trees. Edit operations are defined at the level of single nodes of the dependency tree, and are defined as: i) Insertion - insert a node A from

FP6 IST-033860

Page 13 of 24

the dependency tree of H into the dependency tree of T, ii) Deletion - delete a node A from the dependency tree of T, and iii) Substitution: change the label of a node A in the source tree into a label of a node B of the target tree. As far as the entailment rules are concerned, these contain: i) a left context (an expression that has to be matched by a fragment of T in order the rule to be applied.), ii) a right context (an expression that has to be matched by a fragment of H), iii) a set of constraints (a set of lisp predicates that verify certain constraints in rule application), and iv) a score (a value associated to the rule, expressing the probability of an entailment relation between the left and the right context). Entailment rules represent optional knowledge (in a completely language-independent configuration of the system the entailment rules files will be empty) useful to perform the edit operations. Depending on how knowledge is acquired, the creation of the entailment rules allows for various degrees of language-dependency. Such knowledge can be mined from external resources (e.g. WordNet), if they are available for a given language, or from large corpora (e.g. by means of statistical approaches). For instance, two rules based on WordNet may assign different entailment probabilities to pairs of synonyms (higher score), and antonyms (lower score). As far the cost schemes are concerned, these represent a definition of the actual costs of the basic edit operations. Also in this case, the component allows for both language-independent and language-specific configurations, depending on: i) the type of input of the edit operations (i.e. operating with tokens, lemmas, parts of speech, nodes in a dependency tree), and ii) the use of entailment probabilities, as they are defined in the entailment rules. As can be seen, the flexibility of EDITS allows for a variety of configurations, ranging from completely language-independent modalities6 (Linear Distance algorithm, no entailment rules, cost schemes operating with tokens), to more complex language-specific modalities (Tree edit Distance algorithm, entailment rules built from language-specific resources, cost schemes operating with semantically labeled nodes in a dependency tree).

Usage documentation. Usage documentation.

A user manual for the last downloadable version of the EDITS package is available in the Project website (http://qallme.fbk.eu/).

4.2. Evaluation of the EDITS package4.2. Evaluation of the EDITS package

The development of the general entailment-based architecture, as well as the extension and the improvements to the EDITS package, have been led by the outcomes of periodic evaluations. During the second year, these evaluations have been carried out over different data, in different scenarios, and considering different aspects of the problem. The following sections report on the results of three evaluation experiments, which aim at proving the effectiveness of the proposed approach.

6 Since (at least) a tokenizer is available for almost all the languages, we consider this level of processing as language-independent.

FP6 IST-033860

Page 14 of 24

All the experiments share the same approach described in D5.1 (see Section 3 of D5.1 for further details), which can be summarized as an entailment-based approach to relation extraction for QA over structured data. The problem is set as a relation extraction task, where all the relations of interest in a given domain have to be extracted from natural language questions. Textual Entailment is applied to compare the input questions with a repository of Minimal Relational Patterns (MRPs). The underlying assumption is that a question expresses a certain relation if a textual pattern for that relation is entailed by the question. The motivation is that in the framework of QA over structured data, extracting from an input question all the relevant relations in a given domain becomes crucial, as it allows to fully capture the context in which the request has to be interpreted and, in turn, to determine the constraints on the database query. For instance, given the question “What movie can I see today at cinema Astra?”, an effective database query will select a concept of type MOVIE with specific relations with a DATE (i.e. HasDate(MOVIE:?, DATE:“today”) and a CINEMA (i.e. HasMovieSite(MOVIE:?, SITE:“Astra”)). Successful answer retrieval depends on capturing all and only the relations expressed in the question: unrecognized relations will determine underspecified queries (often leading to redundant answers), while spuriously recognized relations will determine overspecified queries (leading to answer extraction failures).

While the general approach is the same, each experiment used different data to target different aspects of the problem. More specifically:

1. The first experiment used all the English questions in the QALL-ME benchmark. The objective was to compare the two algorithms available in the EDITS package, evaluating the improvements achieved by the Tree Edit Distance algorithm, over the results achieved using Linear Distance;

2. The second experiment used the automatic transcription of the Italian questions in the QALL-ME benchmark. The objective was to compare the performance results achieved by EDITS over optimal inputs (manual transcriptions), with those achieved over sub-optimal inputs (automatic transcriptions);

3. The third experiment used the Italian questions posed by telephone by real users in a focus group for the Cinema/Movie domain. The objective was to set up a real evaluation scenario, in order to assess performance achieved by EDITS over real data.

4.2.1. Experiment 1: Linear Dist4.2.1. Experiment 1: Linear Dist ance Vs Tree Edit Distanceance Vs Tree Edit Distance

In the first experiment (Negri et al., 2008) we compared the two RTE algorithms available in the latest version of the EDITS package, namely Linear Distance and Tree Edit Distance. For a meaningful comparison, due to the unavailability of a reliable dependency parser for Italian, the question dataset used consists in 1487 English questions extracted from the QALL-ME benchmark. Questions are manual transcription of 1223 read, and 264 spontaneous telephone requests, which have been marked as containing one or more relations chosen from a set of 75 binary relations defined in the QALL-ME ontology. The annotated questions are used to create the training and test sets for the experiment. For this purpose, the question corpus was randomly split in two sets, respectively containing 999 and 488 questions. Such separation was carried out guaranteeing that, for each relation R, the questions marked with R are distributed in the two sets in proportion 2/3-1/3. The larger set of 999 questions was used for the acquisition of MRPs and, together with the resulting Pattern

FP6 IST-033860

Page 15 of 24

Repository, was used to train the EDITS package (i.e. to empirically estimate an entailment threshold for each relation, considering positive and negative examples). The smaller set of 488 questions (which remained “unseen” in the MRP acquisition phase) was used as test set.

For each relation R a set of MRPs was manually extracted from the training questions annotated with R. The resulting pattern repository was populated with a total of 449 patterns, with at least 1 MRP per relation (6 on average). Within this experimental setting, the following configurations of the EDITS package have been compared:

• Linear Distance + Default Cost Scheme7 (LD+DEF) - to compare this word-

level algorithm with those based on syntactic structures matching; • Tree Edit Distance + Default Cost Scheme (TED+DEF) - to evaluate the

contribution of considering dependency tree representations of the T-H pairs obtained using Minipar8;

• Tree Edit Distance + Depth Scheme9 (TED+DS) - to evaluate the impact of a cost scheme specifically designed to manipulate nodes in a dependency tree;

• Tree Edit Distance + Width Scheme10 (TED+WS) - to evaluate the impact of another cost scheme manipulating nodes in a dependency tree;

The distances resulting from the previous configurations are also used as features to train a classifier with a Random Forest Learning algorithm (implemented in Weka11), obtaining the last experimented configuration:

- ALL - to evaluate the potential of a combination of all the previously described

distances.

7 The Default Cost Scheme (DEF) is a basic cost scheme in the already available configurations of the EDITS package. In this scheme the cost of the insertion of a text fragment from H in T is equal to the length (i.e. the number of words) of T, and the deletion of a text fragment from T is equal to the length of H. The substitution cost is set to the sum of the insertion and the deletion of the text fragments, if they are not equal. This means that the algorithm would prefer to delete and insert text fragments rather than substituting them, in case they are not equal. Setting the insertion and deletion costs respectively to the length of T and H is motivated by the fact that a shorter text T should not be preferred over a longer one T' while computing their overall mapping costs with the hypothesis H. Setting the costs to fixed values would in fact penalize longer texts (due to the larger amount of deletions needed) even though they are very similar to H. 8 http://www.cs.ualberta.ca/~lindek/minipar.htm 9 In this scheme the cost of the insertion of a node from the dependency tree of H in T, and of the deletion of a node from the dependency tree of T are inversely proportional to the depth (distance from the root) of the nodes in the dependency trees of T and H. The rationale behind this cost scheme is that words that are important to the meaning of the sentence, like verbs, subjects and objects, are usually in the top of the dependency tree and thus they should have higher costs of insertion and deletion. 10 In this scheme the cost of inserting and deleting a node is proportional to the number of children of the node in the dependency trees of T and H. The rationale is that the words that connect phrases of the sentence are meaning preserving and should have higher costs of insertion and deletion. 11 http://www.cs.waikato.ac.nz/ml/weka/

FP6 IST-033860

Page 16 of 24

All these configurations have been also compared with a simple baseline, the Longest Common Subsequence (LCS), a similarity measure often used by RTE systems12.

LCS LD+DEF TED+DEF TED+DS TED+WS ALL

Precision 0.557 0.724 0.687 0.693 0.802 0.832 Recall 0.233 0.521 0.47 0.468 0.501 0.592

F1 0.318 0.606 0.559 0.559 0.617 0.692

Table 1: Performance achieved by different configurations of EDITS.

Table 1 reports results achieved by each configuration. Precision/Recall/F1 scores indicate the system’s ability to recognize the relations expressed in the test set questions. Considering these results, we can draw the following conclusions:

1. All the experimented configurations of EDITS significantly outperform the

baseline (LCS), showing that distance-based algorithms are more suitable to capture entailment relations than simple word matching techniques.

2. As far as the single distance-based algorithms are concerned, TED taken in isolation slightly improves over LD only in one case (TED+WS). In general, we observe that TED alone achieves lower recall than LD. This can be motivated by the parser difficulties in processing questions, and handling some syntactic structures like conjunctions, appositions, and relative clauses. TED+WS performs better than the other TED configurations, as it handles compound nouns and PP phrases more effectively (i.e. it assigns lower costs of deletion to words that connect the main verb with its complements). Consider, for instance, the following T-H pair: Q7: “Where can I see the movie [Movie:Shrek]?” p: where can i see [MOVIE] In this case, to make the complete mapping between T and H, the edit distance algorithm has to delete from T the word “movie”, which is part of a compound noun phrase. Using TED+WS, the contribution of such edit operation to the overall entailment score of the T-H pair will be lower than in the other TED-based configurations.

3. The best result, achieved by the combined configuration (ALL), demonstrates that the different combinations of distance algorithms and cost schemes cover different entailment phenomena, and together they improve over the baseline up to +117% (from 0.318 to 0.692 F1).

4. The combined configuration (ALL) achieves good results especially in terms of Precision. This is particularly important in view of the overall QA application, for which high precision is a requirement. The possibility of answering a question Q, in fact, depends on system’s ability to avoid overspecified (i.e. too restrictive) queries due to false positives in the RE phase (i.e. recognized relations that are not present in Q).

12 Given a text T=<t_1,...,t_n>, and a hypothesis H=<h_1,...,h_n>, the LCS is defined as the longest possible sequence W=<w_1,...,w_n> with words in W also being words in T and H in the same order.

FP6 IST-033860

Page 17 of 24

4.2.2. Experiment 2: dealing with Italian spoken requests4.2.2. Experiment 2: dealing with Italian spoken requests

The objective of the second experiment (Gretter et al., 2008) was to estimate the impact on performance of using automatically transcribed (i.e. noisy) requests to address the task of extracting relations from natural language questions. For this purpose, different experiments have been carried out, running the system under different training/test conditions. In the optimal situation, the system was trained and tested over datasets of manually transcribed Italian questions extracted from the QALL-ME benchmark. The results achieved in this first setting have been compared with: i) those achieved by a system trained over manually transcribed questions, and tested over automatic transcriptions (to reproduce a situation that is closer to a real on-field evaluation), ii) those achieved by a system trained and tested over automatic transcriptions (to verify if the system can “learn” from systematic errors produced by the ASR), and iii) those achieved by training the system over both manual and automatic transcriptions, and testing it over automatic transcriptions (to check if the two sources, used together, give an added value at a training stage). In this evaluation, the experimental setting consists of:

1487 Italian questions, manually annotated as containing one or more relations chosen from a set of 59 binary relations defined in the QALL-ME ontology.

A Pattern Repository populated with a total of 226 patterns, with at least 1 MRP per relation (4 on average).

The evaluation was carried out under the following combinations of clean and noisy training/test data: • Clean/Clean (C/C). In this configuration the system was trained and tested over

manually transcribed questions. This is the same evaluation setting described in the previous Experiment 1, and is used for comparison with the other settings involving noisy data.

• Clean/Noisy (C/N). In this configuration the system was trained over manually transcribed questions, and tested over automatic transcriptions. The idea is to check performance variations with non-homogeneous training/test data.

• Noisy/Noisy (N/N). In this configuration both training and test were carried out over automatically transcribed questions. The idea is to verify if the errors produced by the ASR system can be “learned” by the system.

• Clean+Noisy/Noisy (C+N/N). In this configuration the system was trained over a combination of manual and automatic transcriptions, and tested over automatic transcriptions. The idea is to check if the two sources, used together, give an added value at a training stage.

Table 2: Results obtained with different combinations of clean (C) and noisy (N) data

Quite surprisingly, the results reported in Table 2 show that the impact of dealing with noisy data is almost negligible under all the training/test combinations. This can be explained by the fact that most of the ASR errors typically concern functional words

C/C C/N N/N C+N/N Precision 0.75 0.74 0.74 0.75 Recall 0.55 0.51 0.52 0.56 F-measure 0.63 0.60 0.61 0.64

FP6 IST-033860

Page 18 of 24

(articles and prepositions), which are not really important in determining the meaning of an input question (in terms of the relations it contains). A crucial point could also be the way in which relations are learned, which demonstrated to be quite robust. This also motivates the fact that little differences are observed when clean, noisy or clean + noisy data are used. Such positive results demonstrate that, at least in the proposed task, the benefits of enabling speech access capabilities that provide a more natural human-machine interaction outweigh the minimal loss in terms of performance.

4.2.3.4.2.3. Experiment 3: on field evaluation Experiment 3: on field evaluation

The third experiment has been carried out with the twofold objective of: i) evaluating EDITS, in the Relation Extraction task, within a real application scenario (involving real users asking questions by telephone), and ii) learn from results the possible weaknesses of the system, and improve the core components of the package. A focus group has been created for these purposes, and a working infrastructure tuned for the Cinema/Movie domain has been set up. Evaluation metrics. As far as evaluation metrics are concerned, in contrast with the previous evaluations (which only adopted standard Precision/Recall/F-Measure scores), in the third experiment also an exact match score has been calculated. The exact match score represents the amount of test questions for which all and only the relations annotated in the gold standard are recognized by the system. The exact match score gives a clearer idea of the amount of questions that would be actually answered by the system, independently from the availability of data. This represents a better assessment of the system’s performance, since such information cannot be precisely induced from Precision/Recall/F-measure scores. First round results. A first evaluation has been carried out over a total of 284 Italian questions acquired by the focus group in a one-month period. All the questions were asked at the telephone by real users, and relate to the selected Cinema/Movie domain. All the questions have been annotated as containing one or more relations, chosen from a set of 13 binary relations, relevant for the selected domain, defined in the QALL-ME ontology. The EDITS package, at the core of the QA infrastructure, has been configured trying to reproduce the performance of the first version of the Italian QALL-ME demonstrator13. This configuration, which represents the reference starting point for comparison with further improvements, featured: • Linear Distance algorithm • Default cost scheme, operating at the level of tokens • Empty entailment rules set

As far as the Pattern Repository is concerned, we used the original set of 77 patterns for the Cinema/Movie domain, originally created for the first demonstrator. The results achieved in this experimental setting are:

Precision: 0.76 Recall: 0.43 F1: 0.55 Exact Matches 68/284 (24%)

13 This configuration roughly corresponds to the LD+DEF scheme used for the experiments described in Section 4.2.1.

FP6 IST-033860

Page 19 of 24

Lessons learned, improvements, and second round results. A qualitative analysis of the system’s output suggested several improvement directions. These include:

The use of word lemmas (instead of tokens) as input of the edit operations, in order to generalize the existing MRPs.

The refinement of the default cost scheme, considering word-specific costs in the basic formula.

The extension/improvement of the Pattern Repository. In the first version of the repository, in fact, very frequent patterns for some relations were missing, and some available patterns represented a source of errors due to their ambiguity.

Taking advantage of the flexibility of the EDITS package, further experiments were then carried out, over the same data, with the following configurations.

LD+DEF_lemmas-77 – with i) edit operations considering word lemmas instead of tokens, and ii) the original Pattern Repository with 77 MRPs.

LD+DEF_lemmas+NewCostsENT-77– with: i) edit operations considering lemmas, ii) 77 MRPs, iii) a higher cost for the insertion in T (the input question) of named entities appearing in H, and iv) deletion costs set to 0. The motivation for increasing the insertion costs for named entities is that different words should have different weights while edit operation costs are calculated. For instance, proper nouns and verbs are more important than determiners and prepositions in the computation of the overall cost of mapping T into H. Our hypothesis is that named entities are often relation-specific (e.g. a [DIRECTOR] will always appear into patterns for the relation HasDirector(Movie,Director)), and their contribution to the overall meaning of a pattern is higher than the contribution of any other word. To check the validity of this hypothesis, in this configuration we assigned a weight of 100 to the named entities appearing in H, thus transforming the default insertion cost scheme for named entities from |T| (the length of T) to |T|*100. The motivation for setting the deletion costs to 0 relates to the fact that our questions, as they often involve more than one relation, are usually much longer than the available MRPs. In this situation, a computation of the entailment score using deletion costs set to |H| (i.e. the length of H) will still penalize longer questions, even if they actually contain all the terms of an MRP (due to the larger number of deletions required for the mapping).

1. LD+DEF_lemmas+NewCostsALL-77 – with: i) edit operations considering lemmas, ii) 77 MRPs, iii) a specific insertion cost C(w), empirically estimated over training data for each term in the Pattern Repository, and iv) deletion costs set to 0. This configuration extends and improves the previous one, trough the definition of a specific insertion cost for each word appearing in the Pattern Repository. For each word w, such cost was calculated as: C(w)=#Rels/#Rels_w where #Rels is the total number of relations of interest, and #Rels_w is the number of relations for which at least one pattern in the repository contains w. Since the total number of relations for the Cinema/Movie domain is 13, C(w) will range from 1 (i.e. w appears in at least one pattern for all the relations) to 13 (i.e. w appears in the patterns for only one relations). The higher C(w), the higher the relevance (and the insertion cost) of a word from H in T.

FP6 IST-033860

Page 20 of 24

1. LD+DEF_lemmas+NewCostsALL-149 – the previous configuration, accessing a new Pattern Repository with 72 additional MRPs (149 in total).

The results achieved with these configurations are reported in Table 4.

LD+DEF lemmas-77

LD+DEF_lemmas +NewCostsENT-77

LD+DEF_lemmas +NewCostsALL -77

LD+DEF_lemmas +NewCostsALL-149

Precision 0.78 0.87 0.93 0.95 Recall 0.57 0.82 0.92 0.95 F1 0.66 0.85 0.93 0.95 Exact Matches

71/284 (25%)

159/284 (55%)

215/284 (75%)

234/284 (82%)

Table 3: Second round results

As can be seen, all the new configurations improve the reference scores (i.e. P:0.76, R:0.43, F1:0.55, Exact Matches 68/284), with considerable increases of the exact matches both with the original Pattern Repository, and with the extended version of the Repository. These results confirm our working hypotheses, and represent a significant step forward in the realization of a working infrastructure for entailment-based QA over structured data.

4.3. 4.3. Automatic acquisition of costs for edit operationsAutomatic acquisition of costs for edit operations

In order to estimate the optimal cost of each edit operation in a cost scheme, as well as being able to weight the costs such as substituting the terms appearing in the entailment rules, a stochastic method based on Particle Swarm Optimization (PSO, Eberhart et al. 2001) was implemented as a new module in EDITS. Configuring the PSO module14, we try to learn the optimal cost for each edit operation in order to improve the prior textual entailment model. The main goal of using this module is to automatically estimate the best possible operation costs on the development set. Beside that, such method allows to investigate the cost values to better understand how TED approaches the data in textual entailment. Moreover, taking advantage of automatic estimation of costs using PSO would allow the user to find the optimal weight of different resources and measure the contribution of them on a dataset. The positive results achieved with the integration of the new component are reported in Mehdad (2009). The improved system has been used for FBK’s participation in the RTE5 Evaluation Campaign (Mehdad et al., 2009), and in the EVALITA RTE Task (Mehdad et al., 2009).

4.4. Open4.4. Open -- source release of the EDITS pasource release of the EDITS pa ckageckage

The delivery of EDITS as open-source software was one of the planned activities for the third development cycle of the project. According to these plans, the first release of

14 JSwarm-PSO, http://jswarm-pso.sourceforge.net/ has been used for this purpose

FP6 IST-033860

Page 21 of 24

the EDITS (1.0) package is now available under GNU Lesser General Public License (LGPL), and can be downloaded from http://edits.fbk.eu. On October 14, 2009, the number of downloads was 89. In addition, to allow potentially interested users for experimenting with EDITS, a Web interface to the system (http://qallmedemo.fbk.eu/editsui/) has been implemented. The interface can be run both over Italian and English RTE pairs, with different configurations resulting by the combining:

• the two distance algorithms (i.e. token edit distance, and tree edit distance) implemented by the current version of EDITS;

• four predefined cost schemes (i.e. dynamic costs, fixed costs, dynamic costs used for FBK’s participation in RTE5, and dynamic costs used for EVALITA 2009);

• three predefined sets of entailment rules (i.e. English rules extracted from Verbocean15, and WordNet16, Italian rules extracted from Wikipedia17,)

The web interface allows for training EDITS over different RTE datasets (namely the PASCAL RTE1,2, and 3 development sets for English, and the EVALITA 2009 development set for the Italian language). Once the system is trained over one of these RTE corpora, testing can be done either over one of the available test sets (namely the PASCAL RTE1,2, and 3 test sets for English, and the EVALITA 2009 test set for the Italian language), or over a text/hypothesis pair input by the user.

ReferencesReferences • David Beckett, D., Berners-Lee, T. (2008): Turtle – Terse RDF Triple Language.

W3C Team Submission 14 January 2008. <http://www.w3.org/TeamSubmission/2008/SUBM-turtle-20080114/> Retrieved on November 3, 2009.

• Negri, M. (2007): Semantic Based Answer Extraction Modules. I cycle. QALL-ME project document QALL-ME_D5.1_P_20071016.

• Ou, S., Orasan, C. and Pekar, V. (2007): D4.1 Annotated documents and database for storing them, IE modules, semantic IR engine (first cycle). QALL-ME project document QALL-ME_D4.1_R_20071010.

15 Verbocean(http://demo.patrickpantel.com/Content/verbocean/) has been used to extract 18232 entailment rules for the English verbs appearing in the PASCAL RTE4 dataset, which are connected by the “stronger-than” relation. For instance, if kill [stronger-than] injure, then kill ENTAILS injure. 16 WordNet 3.0 (http://wordnet.princeton.edu/) has been used to extract 2698 English entailment rules for terms in the PASCAL RTE4 dataset connected by hyponymy and synonymy relations. For instance, if vehicle [has-hyponym] car, then car ENTAILS vehicle. 17 Entailment rules (for 91692 Italian nouns) have been extracted from Wikipedia using the jLSI (java Latent Semantic Indexing tool (http://tcc.itc.it/research/textec/tools-resources/jlsi.html) to measure relatedness between the term pairs in the EVALITA 2009 training data

FP6 IST-033860

Page 22 of 24

• Ou, S. (2008): D4.2 Annotated documents and database for storing them, IE modules, semantic IR engine (second cycle). QALL-ME project document QALL-ME_D4.2_P_20081115.

• Ponomareva, N., Gomez, J. M. and Pekar, V. (2009): AIR: a semi-automatic system for archiving institutional repositories. In: Proceedings of the 14th International Conference on Applications of Natural Language to Information Systems (NLDB 2009). Saarbrücken, Germany, June 24–26 2009.

• Prud’hommeaux, E. and Seaborne, A. (2008): SPARQL Query Language for RDF. W3C Recommendation 15 January 2008. <http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/> Retrieved on November 3, 2009.

• Sacaleanu, B. and Spurk, C. (2007): First QALL-ME Prototype: Architecture and Functionality. QALL-ME project document QALL-ME_D7.1_20071015.

• Sure, Y., Bloehdorn, S., Haase, P., Hartmann, J. and Oberle, D. (2005): The SWRC Ontology – Semantic Web for Research Communities. In: Bento, C., Cardoso, A. and Dias, G. (2005): Proceedings of the 12th Portuguese Conference on Artificial Intelligence – Progress in Artificial Intelligence (EPIA 2005); volume 3803 of LNCS, pp. 218–231. Springer (Covilhã, Portugal).

• Spurk, C. and Neumann, G. (2008): Intermediate QALL-ME Prototype: Architecture and Functionality. QALL-ME project document QALL-ME_D7.2_20081115.