matching natural language multi domain queries to search service

8/2/2019 Matching Natural Language Multi Domain Queries to Search Service

http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 1/96

POLITECNICO DI MILANO

FACOLTÁ DI INGEGNERIA

CORSO DI LAUREA SPECIALISTICA IN INGEGNERIA INFORMATICA

MATCHING NATURAL LANGUAGEMULTIDOMAIN QUERIES TO SEARCH SERVICES

Relatore: Ing. Marco BRAMBILLA

Correlatore: Prof. Stefano CERI

Tesi di Laurea Specialistica di:

Claudia Farè

Matricola n. 721154

ANN O ACCADEMICO 2008-2009



The computer was born to solve problems that did not exist before.

Bill Gates



Contents

1 Introduction 8

1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Background Work 11

2.1 SeCo, beyond Page Search . . . . . . . . . . . . . . . . . . . . . 11

2.2 The General Architecture . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 The registration flow . . . . . . . . . . . . . . . . . . . . 14

2.2.2 The query execution flow . . . . . . . . . . . . . . . . . . 14

2.2.2.1 Query analysis . . . . . . . . . . . . . . . . . . 15

2.2.2.2 Query to domain and service mapping . . . . . 15

2.2.2.3 Query Planner . . . . . . . . . . . . . . . . . . 16

2.2.2.4 Query engine . . . . . . . . . . . . . . . . . . 17

2.2.2.5 Result transformation and Interfaces . . . . . . 17

1



CONTENTS 2

2.3 Service Marts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 The Natural Language Framework . . . . . . . . . . . . . . . . . 19

2.5 WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 WordNet Domains . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Stanford Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.8 Name Entity Recognition . . . . . . . . . . . . . . . . . . . . . . 22

2.9 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Related Work 29

3.1 WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 WordNet Domains . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Name Entity Recognizer . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Query splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 The Thesis Project Contribution 34

4.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Query Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.1 The parsing . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.2 The splitting . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 The Extraction of the Data Types . . . . . . . . . . . . . . . . . . 40



CONTENTS 3

4.5 Mapping to domains . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.1 Methods to improve the domain score . . . . . . . . . . . 43

4.6 The Service Mart Repository . . . . . . . . . . . . . . . . . . . . 44

4.7 Map sub-queries to Service Marts . . . . . . . . . . . . . . . . . 45

4.8 Map sub-queries to Access Patterns . . . . . . . . . . . . . . . . 45

4.8.1 The semantic name matching . . . . . . . . . . . . . . . . 46

4.8.2 Evaluation Criteria and Statistics . . . . . . . . . . . . . . 47

5 Implementation 52

5.1 The system general architecture . . . . . . . . . . . . . . . . . . 52

5.2 The Sift Application . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.1 Bee - Distributed Background Processing . . . . . . . . . 56

5.3 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3.2 Sentence Splitting Strategies . . . . . . . . . . . . . . . . 60

5.3.3 Information extraction . . . . . . . . . . . . . . . . . . . 62

5.3.4 Service Mart Semi-Automatic Generation . . . . . . . . . 67

5.3.5 Map sub-queries to access patterns . . . . . . . . . . . . . 70



CONTENTS 4

6 Evaluation 72

6.1 Creation of the corpus of queries and service marts . . . . . . . . 72

6.2 The Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.3 The results and the Evaluation . . . . . . . . . . . . . . . . . . . 74

6.3.1 Entries evaluation . . . . . . . . . . . . . . . . . . . . . . 74

6.3.2 Splitting Evaluation . . . . . . . . . . . . . . . . . . . . 76

6.3.3 Domain Extraction Evaluation . . . . . . . . . . . . . . . 78

6.4 Service Mart Matching Evaluation . . . . . . . . . . . . . . . . . 80

6.5 A Complete example of info extraction, splitting and matching. . . 80

7 Conclusions 84

7.1 Objectives and Final Evaluation . . . . . . . . . . . . . . . . . . 84

7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8 Appendix 87



List of Figures

2.1 The overall architecture of the system, together with the two main

execution flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 The research flows for the Natural Language Framework . . . . . 19

2.3 The sample pie chart . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 The trees retrieved from the analysis of the query . . . . . . . . . 35

4.2 The tree for the correct example . . . . . . . . . . . . . . . . . . 36

4.3 The semantic modelization of the Service Mart . . . . . . . . . . 38

4.4 The WordNet Domains Hierarchy . . . . . . . . . . . . . . . . . 42

4.5 Example of sub-query/AP matching . . . . . . . . . . . . . . . . 46

4.6 Example statistics first level split . . . . . . . . . . . . . . . . . . 49

4.7 Example Statistics for clausesplit . . . . . . . . . . . . . . . . . . 50

4.8 Example statistics for domains extraction . . . . . . . . . . . . . 51

5.1 The Architecture schema . . . . . . . . . . . . . . . . . . . . . . 52

5.2 The models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5



LIST OF FIGURES 6

5.3 Screen of the sift application . . . . . . . . . . . . . . . . . . . . 55

5.4 The Bee Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5 The algorithm schema after the splitting . . . . . . . . . . . . . . 62

5.6 The Information Extraction Flow . . . . . . . . . . . . . . . . . . 62

5.7 Domain extraction algorithm structure . . . . . . . . . . . . . . . 64

5.8 A graphical sample of a substructure of the WordNet hierarchy . . 65

5.9 A sample of a generated service mart data structure . . . . . . . . 69

5.10 Mapping schema . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.1 The Main Screen of the Sift Application . . . . . . . . . . . . . . 74

6.2 First Level Split Statistics . . . . . . . . . . . . . . . . . . . . . . 76

6.3 Clause Level Split Statistics . . . . . . . . . . . . . . . . . . . . 76

6.4 Wrong First level Splitting . . . . . . . . . . . . . . . . . . . . . 77

6.5 WordNet Domain Statistics . . . . . . . . . . . . . . . . . . . . . 78

6.6 WordNet Domain Statistics Optimized . . . . . . . . . . . . . . . 79

6.7 Service Mart Matching Statistics . . . . . . . . . . . . . . . . . . 80

6.8 The Clause split of the sample entry . . . . . . . . . . . . . . . . 81

6.9 The trees of the clause split division . . . . . . . . . . . . . . . . 81

6.10 Matching for the sub-entries . . . . . . . . . . . . . . . . . . . . 82

6.11 Service Mart matching for the sub-entries . . . . . . . . . . . . . 83



List of Tables

5.1 The task interface . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Bonus Refinement example . . . . . . . . . . . . . . . . . . . . . 66

5.3 List of Groups for the Service Mart Generation . . . . . . . . . . 68

6.1 Domain Extraction Results summary . . . . . . . . . . . . . . . . 79

6.2 The Data Types extracted . . . . . . . . . . . . . . . . . . . . . . 82

7



Chapter 1

Introduction

1.1 Context

In the last years a lot of efforts have been spent on the research in information re-

trieval either over the subject of full-text search or document indexing. The mainfruits of these efforts have been the general purpose search engines everyone of us

uses in their life like Yahoo™ and Google™, the former has even become a proper

verb in the English language given the popularity of the term. These engines gives

us the possibility to retrieve any document available on the web about the topic we

are searching for. If with the World Wide Web the democratization of information

availability began, with these search engines it grew to a maximum. However

this type of simple but wide search brought along some limitations. Users don’t

want to look for generic documents about a topic anymore, they want answers to

specific questions as the search engine were a human being that understood their

needs and satisfied them. In order to look for an answer, with a general purpose

engine as Google™, users usually have to hope to find a document where someone

has already asked that question or rely on the reading of a number of documents

hoping to find what they were looking for. A lot of researches and projects ex-

plored this field and one notable effort is the one represented by Knowledge Based

search systems. These systems allow the user to ask for a specific question based

8



CHAPTER 1. INTRODUCTION 9

on a knowledge base built on large ontologies that can select the right answers.

This is very efficient for “non-changing” information such as technical, mathe-

matical, geographical and physics questions. Instead it’s really unreliable for ever

changing data like news and events. Moreover the number of domains that the

request can involve is restricted to one, only specific question about one topic at

a time can be asked. So the object of the future research is to lift the limitation

of the single domains questions and to provide results not only about precise facts

but also about questions where the answers can involve more domains with possi-

ble rankings based on features. For example the question “I want a cheap Chinese

restaurant near piazza Duomo in Milan” involves two domains “place” and “Chi-nese restaurants” and it requires a ranking based on the price. In the last years

there has been an increase in popularity of the web services. These services of-

fer a software interface which allows other systems to interact with them through

the HTTP protocol. The proliferation of open and accessible web search services

has allowed the world to access, aggregate and mix data in previously unthought

ways. From this prerogatives the SeCo project at Politecnico di Milano was born.

This project is currently under active development and it aims at building a system

that pushes the boundaries of the current search engines.

1.2 The Problem

Although many discoveries have been made in theoretical and formal aspects of

distributing multi-domain queries and merging back the results, a lot of work still

has to be done in the subject of interfacing the system with the user in the most

natural way. Usually interfaces for such services are complex and have to be set

manually sometimes with a very little user friendly syntax. Instead services as

Yahoo!™ or Google™ have popularized the simple text box where free text can

be entered; the filtering and understanding is completely left to the service while

the user writes as he would write to another understanding entity. This is the main

problem that led us to the current project of research in the field of query analysis,

specifically oriented towards understanding, translating and matching with the

right services the queries made in the SeCo project and that can span more than



CHAPTER 1. INTRODUCTION 10

one domain. The answering of multi-domain questions in a non-automated way

is a complex and boring job for users because they need to coordinate the answers

from various services. If we can examine and extract multi-domain information

from a single query and match the elements of the query directly with the right

services, this will be a great step toward a fully functioning multi-domain search

engine.

1.3 Objective

The main objective of this thesis project is to set up an analysis and matching en-

vironment for the natural language multi-domain queries and examine the results

retrieved with the tools experimented. This environment is based on the Sift ap-

plication and through the use of many splitting and information extractions tools

will allow us to examine the entries and match them to suitable web search ser-

vices or service marts. In the information extraction part we will then translate the

input questions, from the natural way as a user would input them, into a form the

system can understand and act upon. In the matching section we’ll try to match

the given queries to suitable service marts that will then be the starting point of

the information retrieval process.



Chapter 2

Background Work

2.1 SeCo, beyond Page Search

In the last few years Internet search has been performed mainly routing users to

the web page that best answered the question they submitted. Typically the page

search services that are available online are categorized in three main kinds.

The general purpose search engines: the most popular and widely used, as Google™

and Bing™, base their searches on relevance and ranking indexes that are updated

depending on the importance and the popularity of each web page. These search

engines are the most popular because of their ability to fulfill user needs; how-

ever not all the information requests can be satisfied by web pages (the so called

“surface web”). Most of the information available on the Internet is in the “deep

web”, this expression refers to all the dynamically generated sites, whose contentcan’t be accessed through search engines crawlers.

Another kind of search technology are the knowledge based search systems. They

base their searches on large previously built ontologies that select the right answer

to the question. With this approach, the wider the ontology is, the more effective

the results will be. This method is superior to conventional search to answer

queries over well-structured or organized knowledge. The downside is that such

11



CHAPTER 2. BACKGROUND WORK 12

wide ontologies require long developing times and great efforts in updating the

knowledge base, with no possibility to add dynamical or ever changing data like

weekly events or news.

The third approach are Meta-Search engines, these engines combine the results

about a single domain request in a way that would have taken hours for a simple

user to achieve using only the generalist engines. For instance a meta-search

engine can provide a list ordered by price of the flights between two cities in a

few seconds, a task that otherwise would have required a simple user the lengthy

visit to companies and travel agencies sites. The main downside of these search

engines is the single domain limit.

With none of the proposed approaches one is able to reach a multi-domain answer

in a single search.

These kind of searches, though very effective for a large number of queries, don’t

support multi-domain requests. If a multi-domain question is given to them the

result will be very likely unsatisfactory, unless the same combination of multi-

domain data can be found on an existing web page.

The SeCo project aims at the creation of a multi-domain search system based

on web services. This platform aims at pushing the limits of the field of multi-

domain queries by formalizing theoretical aspects as well as providing a software

engineering point of view, enabling the construction of a usable search engine

that will answer arbitrary queries.These queries will be analyzed and matched to

suitable web services, the results will be finally aggregated and the user will be

able to visualize multiple domain results in response to a single request.

2.2 The General Architecture

The SeCo project is divided in different higher-level components composed in a

service oriented manner. Within the multi-domain query answering problem the

SeCo architecture can be divided in two main activity flows: the registration flow




- that deals with the creation of new domains, domain descriptions, and search

services within the framework; the query execution flow - that deals with the

actual enactment of the queries. The main components are the query analysis,

the query-to-domain mapper, the query planner, the query engine and the results

transformation. Two frameworks named the service and domain frameworks are

also added as intelligent repositories.

In the query execution flow a query sent from the user first passes through the

query analysis and the query-to-domain mapper, where the different domains and

properties are extracted from the natural language query. It then goes to the query

planner, which creates an execution plan taking into accounts the different costs

associated to executing the query, in order to create the most efficient execution.

The different sub-queries are then sent to the domain and service frameworks,

which take care of calling the external services through a Web or messaging in-

terface. The results are then collected, and, according to the plan, merged back

together. The final results are then transformed before being sent back to the user.

While the query execution flow interests all the analysis from a request of the end

user to its response from the system, the activity in the registration flow interests

mainly the registration of search services by service designers or other developers.

Figure 2.1: The overall architecture of the system, together with the two main

execution flows.




2.2.1 The registration flow

The registration flow comprises all the activities that deal with the registration

of new domains, domain descriptions and search services. This section will be

briefly explained because it doesn’t interest directly the thesis project.

The domain framework deals with domains and their definitions and addresses the

problems of semantic annotation, storage, management, and access to domains

and their descriptions. On the concept of domain is based all of the multi-domain

search engine. A domain is considered as a self-standing field of interest for

the user. The domain repository is a data structure that is able to store domains

organized as a taxonomy, representing a tree of domain/sub-domain relationships.

Information about the domains can be retrieved by other components through an

API.

The search service framework defines a conceptual model of search service and

addresses the semantic annotation, storage, management, and access to search

services. Its main function is to enable the annotation of the request/response

interface of the services. Such annotation uses the WordNet vocabulary and addslabels to each service, its operations, and the input-output parameters of each

operation. The framework is concerned only with those operations belonging to

a Web service which perform data retrieval, particularly with operations which

return itemized and ranked information.

The service analyzer addresses the following problems: the clustering of the avail-

able services, based on their similarity, the mapping of services to domains and

the definition of join connections between services.

2.2.2 The query execution flow

Along the query execution flow we address the following problems:

The main components are the query analysis, the query-to-domain mapper, the

query planner, the query engine and the results transformation. A query sent from




the user first passes through the query analysis and the query-to-domain mapper,

where the different domains and properties are extracted from the natural lan-

guage query. It then goes to the query planner, which creates an execution plan

taking into accounts the different costs associated to executing the query, in order

to create the most efficient execution. The different sub-queries are then sent to

the domain and service frameworks, which take care of calling the external ser-

vices through a Web or messaging interface. The results are then collected and,

according to the plan, merged back together. The final results are then transformed

before being sent back to the user.

2.2.2.1 Query analysis

In this section high level multi-domain user queries are analyzed and a splitting

into sub-queries is made. A high level query is the specification of an information

need of a user at a high level of abstraction. It’s assumed that high level queries

are quasi-natural language descriptions of the user request which may require to

extract information from multiple domains. The query analysis component de-

composes the high-level queries into sub-queries, each representing one search

objective in a specific domain. For processing the natural language query, an

open source tool developed by the Stanford Natural Language Processing Group

is used.

2.2.2.2 Query to domain and service mapping

This component addresses the problems of mapping of sub-queries to domainsand subsequently to associated search services, at the purpose of defining low-

level queries. To successfully map a sub-query to a domain we need to retrieve

for each of them a defined subset of similar domains which allow a crisp iden-

tification of the sub-query semantic, that due to the use of natural language can

be ambiguous and imprecise. Several techniques can be applied to optimize the

recognition of query/sub-query structures which comply with the separation into




distinct domains of concern so as to achieve the objective; some of these methods

will be analyzed in their meaning and implementation in the next chapters.

2.2.2.3 Query Planner

A low-level query is a composite query over a number of services. The query

planner is a well-defined scheduling of service invocations, possibly parallelized,

that complies with their access modes and exploits the ranking order in which

search services return results to rank the results. The Query Planner addressesthe problem of generating query plans and evaluating them against a cost metric

so as to choose the most promising one for execution. It accepts as input low-

level queries, i.e. conjunctive queries that list the specific services to be invoked,

already chosen by the Query-To-Domain Mapper. Then it schedules the invoca-

tions of Web services and the composition of their inputs and outputs. In the end

it progressively refines choices and produces an access plan by performing the

following steps:

1.Given that services may be accessed according to different patterns, the Query

Planner chooses specific access patterns for each of the services involved in the

query, provided that they are compatible with the query.

2.Once the access patterns are fixed, there may still be some indeterminacy on the

order of invocation of the different services, some of which may be invoked in

parallel. The Query Planner fixes such order.

3.The main operation for combining search services in our conjunctive setting is

the join. The Query Planner selects an execution strategy for each join.

4.Optimality of execution primarily depends upon the cost and time of execu-

tion of request/responses to services. The Query Planner determines the expected

number of requests associated with each service request in order to obtain the

desired number of results, so as to associate to each plan an execution cost.




2.2.2.4 Query engine

This component takes the low-level plan from the query planner and executes the

different service calls in parallel, merging and ordering when required. It will

return the final results of the query in a pure and internal format as they become

available, sending them to the results transformation component for their final

processing.

The query engine deals with the generation and processing of query execution

schedules: it takes the low-level plan from the query planner and executes thedifferent service calls in parallel, merging and ordering when required. The results

generated and the combinations returned are collected in their “raw” format of

tuples of values, and passed to the Result Transformation module, to be processed

in order to be presented to the user.

2.2.2.5 Result transformation and Interfaces

This component is dedicated to the definition of proper interfaces for submission

of multi-domain user queries and transformation of the results in the format re-

quested by the final user. It deals with: building a interface for the user to express

multi-domain queries in a facilitated way and building an interface for presenting

results. In the former the user can drill down the result set and understand where

each piece of information comes from, enabling query refinement, or can peruse

the results of past queries to better reformulate his information need.

2.3 Service Marts

The Service Mart component is an abstraction used to manage the publication

and handle the data sources in the Search Computing architecture. The goal of

a service mart is to ease the publication of a special class of software services,

called search services, whose responses are ranked lists of objects. Every service




mart is mapped to one "Web object" available on Internet; therefore, we may have

service marts for “hotels”, “flights”, “doctors”, and so on. Thus, service marts

are consistent with a view of the "Internet of objects" which is gaining popularity

as a new way to reinterpret concept organization in the Web and go beyond the

unstructured organization of Web pages.

A Service Mart is a component with a known interface defined at project time

which manages a collection of similar or semantically correlated services. The

Service Mart can invoke these services presenting itself as a standard interface

between the request from a query and its result. The underlying complexity can

be then hidden to the higher level and the result can be a completely relational

model, simplified w.r.t. the original complexity of the web services model.

A Service Mart is defined by an Id, a Name and a Description which documents

its functionalities. It’s then divided on different levels of abstraction: the highest

level is the Service Mart Signature, it contains a description of the service mart

attributes that are the sample input and output data that the Mart can handle and

repeating groups consisting of a non-empty set of sub-attributes that collectively

define a property of the service mart. In the underlying level there are the AccessPatterns. Their structure is analogue to the Signature and they specify an ulterior

possible invocation mode. Each parameter in an Access Pattern is identified by a

data type, a “mandatory” flag and a direction (input or output). At the third lower

level there are the Service Interfaces. A service Interface is a concrete description

of an access pattern, it has an interface with its attributes and it’s linked to a service

implementation, the real link to the web service (to retrieve data from local or

remote sources).

Connection patterns represent the coupling of service marts (at the conceptual

level) and of service interfaces (at the physical level). Each pattern has a con-

ceptual name and then a logical specification, consisting of a sequence of simple

comparison predicates between pairs of attributes or sub-attributes of the two ser-

vices, which are interpreted as a conjunctive Boolean expression, and therefore

can be implemented by joining the results returned by calling service implemen-

tations.




Visually, service marts and connection patterns can be presented as resource graphs,

where nodes represent marts and undirected arcs represents connection patterns.

The model of the web proposed by Search Computing is based on a simplifica-

tion of reality, which is seen through potentially very large resource graphs. This

visualization enables the linking of interconnected concepts which support the

creation of multi-domain queries through ad-hoc user interfaces.

2.4 The Natural Language Framework

The natural language processing framework used as a starting point for the thesis

was the fruit of a double phase research as illustrated in the figure, the main goal

of the design of this framework was to assemble a complete corpus of queries and

analyze them efficiently so that the data retrieved could be the starting point of the

testing of the SeCo search engine.

Figure 2.2: The research flows for the Natural Language Framework

The aim of the framework is to create an environment to analyze input queries and

extract information about their characteristics and domains, subsequently these

information will be used for the elaboration of a suitable matching with corre-

sponding search services.




In the first phase, a corpus that responds to the needs of the project, that is, assem-

bled of as many multi-domain queries as possible, was created from scratch using

publicly available data. This data is acquired from the publicly available service

Yahoo! Answers, specifically from the touristic question section. This choice

was driven by the fact that in that section is very likely to find multi-domain re-

quests due to the multifaceted subject. In a second phase, from this larger corpus,

a smaller but most interesting section is taken and analyzed in depth. Again this

analysis has two aspects. The first one is the splitting of a question in the diverse

domains that constitute it, and extract the important objects from those parts. The

second is the association of the resulting objects with one or more semanticaldomain of knowledge that will be mapped to the corresponding services.

All these analysis are elaborated in a web application environment called Sift.

More details will be explained in the implementation section.

2.5 WordNet

WordNet is a lexical database for the English language that aims at organizing,

defining and describing concepts through a semantic network. The organization

of the lexical features is defined with the grouping terms with similar meanings

called synsets and the linking of their meanings through a number of different

relations. The latest available version of the database (WordNet 3.0) contains more

than 150.000 terms organized in 117.659 synsets. Moreover, given the WordNet

success, a lot of lexical networks have been developed to link the WordNet terms

to other languages, as a multilingual search support. Semantic relations available

in WordNet are categorized according to the terms that are part of that specific

relations. Among nouns the principal semantic relations defined are: hypernyms,

hyponyms, holonyms, meronyms. Relations are defined also among verbs and

adjectives.




2.6 WordNet Domains

WordNet Domains is a lexical resource, that can be considered an extension of

WordNet, in which synsets have been annotated in a semi-automatic way with

one or more domain labels. A domain may include synsets of different syntactic

categories and from different WordNet sub-hierarchies[1].

WordNet Domains contains 200 domains labels in a hierarchical structure (the

WordNet Domains Hierarchy) organized as in the Dewey Decimal Classification

(DDC), a general knowledge organization tool which is the most widely usedtaxonomy for library organization purposes. Each synset of WordNet 2.0 was

labeled with one or more labels, using a methodology which combines manual

and automatic assignments.

The whole infrastructure of the multi-domain search engine is based on the con-

cept of domain. A domain is considered as a self-standing field of interest for

the user, such as music, sport, arts, tourism, computer science, and so on. The

annotation of every synset in WordNet domains allows to characterize a domain

in terms of most frequently used terms for describing concepts in that domain,

and viceversa to identify for each synset the list of domains it refers to. One

of the most interesting and urgent task in search computing was to investigate

if WordNet Domains can facilitate the task of partitioning queries and associat-

ing them to specific search engines and data sources. The domain repository is

a data structure that is able to store domains as described above. In this solu-

tion, we assume that domains are organized as a taxonomy, representing a tree of

domain/sub-domain relationships. Information about the domains is made avail-

able to the other components through an API that exposes interfaces for querying

and updating the domain structure (i.e., creation, deletion, and update of domain

information, including associated synsets and services).




2.7 Stanford Parser

This tool implements a probabilistic lexical parser of English natural language

sentences. The outcome of the parser is a tree representation of the sentences that

is suitable for the problem of splitting the queries into sub-queries to be assigned

to different domains.

Probabilistic parsing is using dynamic programming algorithms to compute the

most likely parse(s) of a given sentence, given a statistical model of the syntactic

structure of a language. Models have been developed for parsing in several lan-guages: English (the corpus used for this research), Chinese, Arabic, and German.

The Stanford Parser is a Natural Language Processing suite of tools and libraries

that can be used in various tasks related to natural language analysis. In the context

of this research, it is used for its parsing abilities. It is based on a probabilistic

model and it is implemented as a Java library accompanied by a dictionary file

that is used as training data.

The very detailed parsing of a sentence or a period gives the possibility to try alot of different approaches to the splitting and analysis of the natural language. In

the framework in particular two main approaches have been researched: the first

level splitting and the clause level splitting. The research and result details about

these approaches are examined in the next chapters.

2.8 Name Entity Recognition

The tool we used as Name Entity Recognition (NER) is the CRF(Conditional

Random Field)-based NER system developed by the Stanford NLP Group[2].

Named entity recognition (also known as entity identification and entity extrac-

tion) is a subtask of information extraction that seeks to locate and classify atomic

elements in a text into predefined categories such as the names of persons, orga-

nizations, locations etc.




Given a text in input, the NER system produces a parsed output that highlights the

entities found in the document.

In particular the Stanford system can recognize a great number of persons (famous

people or proper names), organizations (companies, government organizations,

committees, etc.), locations (cities, countries, rivers, etc) and other miscellaneous

entities. This system is trained on the the CoNLL-2003[3] named entity data that

consists of eight files covering two languages: English and German. The English

data was taken from the Reuters Corpus which consists of Reuters news stories

between August 1996 and August 1997.

The CoNLL-2003 data files contain four columns separated by a single space.

Each word has been put on a separate line and there is an empty line after each

sentence. The first item on each line is a word, the second a part-of-speech (POS)

tag, the third a syntactic chunk tag and the fourth the named entity tag. The

chunk tags and the named entity tags have the format I-TYPE which means that

the word is inside a phrase of type TYPE. Only if two phrases of the same type

immediately follow each other, the first word of the second phrase will have tag

B-TYPE to show that it starts a new phrase. A word with tag O is not part of aphrase. Here is an example:

U . N . N N P I - N P I - O R G

o f f i c i a l N N I - N P O

E k e u s N N P I - N P I - P E R

h e a d s V B Z I - V P O

f o r I N I - P P O

B a g h d a d N N P I - N P I - L O C

The data consists of three files per language: one training file and two test files

“Test A” and “Test B”. The first test file is used in the development phase for

finding good parameters for the learning system. The second test file is used for

the final evaluation.




2.9 Technologies

Following is an overview of the main technologies and tools used to implement

the framework.

JavaScript Object Notation The Javascript Object Notation or JSON is a lightweight

data-interchange format similar to XML. It is a text-based, human-readable for-

mat for representing simple data structures and associative arrays (called objects).

It is based on the Javascript syntax for describing data structures. It supports avariety of data structures, the most commonly used in high-level languages. It

was chosen over other data exchange formats such as XML for its simplicity and

readability. Its usability and ease to map it to the data types provided by most

languages, makes it very natural to convert back and forth and it’s also supported

over a multitude of languages and frameworks, libraries having been implemented

in every high-level language available.

CouchDB CouchDB is an Apache Foundation project for a document-based

database server written in Erlang, a highly efficient language for concurrent and

distributed applications. It diverges itself from the model of relational databases in

many ways, and offers a very different performance profile. CouchDB stores free-

form documents instead of records as can be seen in a regular relational database.

Its schemas are flexible, and the elements can change from one document to an-

other in a same database. This can be useful in many different applications, such

as ones where schemas are highly likely to change over time, or in situations

where the rows are very sparse, that is, many fields are present but only few areactually used in a single document. The server is accessible via a RESTful JSON

API. JSON is its native data format, and this makes it very flexible in terms of

what data types can be stored. It also supports computed views, which replace

indices, and which are created in JavaScript by the user. These views follow the

Map/Reduce paradigm, where a first function (map) is tasked with going over ev-

ery document, emitting a key/value pair which can be any given JSON element.




The second function (reduce), then sorts and groups elements by their keys, and

transforms and reduces the array of values associated with that key into a singular

atomic element. The contract is that the computation of one element is totally

independent from the computation of any other, allowing the system to distribute

the work, cache it aggressively and reorder it as needed in order to improve the

performance. It also supports keeping multiple revisions of a single document, al-

lowing the user to require a particular version. It also gives the opportunity to offer

optimistic conflict resolution for updates, where, during an update operation, the

sender is required to state which version its change is based on. If that version cor-

responds to the currently most up-to-date, the update is made without any trouble.In the other case, if another user had already updated the same document, an error

message is sent to the user who is then given the opportunity to rebase himself

on the latest version. Other interesting features of CouchDB are its core support

for master-master replication, where two nodes can be synchronized and where

both can still act as master, unlike the normal model master-slave model where

slaves are only used for read operations while the master is the unique point of

update. CouchDB was chosen within the context of this project with the idea that

the schema was most likely to change greatly over the progress of the research,and that the objects that we would need to store would not fit a relational database

very well.

Ruby Ruby is a high-level programming language known as being highly dy-

namic and flexible in regards to its syntax. Ruby supports multiple programming

paradigms, including functional, object oriented, imperative and reflective. It also

has a dynamic type system and automatic memory management. While its im-

plementation is relatively slower than other languages, it has become famous forallowing the creation of DSL, Domain-Specific Languages, where the host lan-

guage itself is adapted in order to create a more natural syntax adapted to the task

at hand. In particular, it has become famous for its use in the Web domain, where it

now sports a host of libraries adapted to quickly and efficiently creating Web Ap-

plications. It is a pure object-oriented language, where every method or function

is actually activated by sending a message to the desired instance. Every element




in the code can be considered as objects, even literal string and numbers. It fol-

lows in this the tradition of the Smalltalk chain of languages. It also allows the

re-opening and modification of already-defined classes, even the ones that make

part of the core and standard library.

Sinatra Sinatra is a Domain Specific Language (DSL) for quickly creating

web-applications in Ruby. It’s extremely simple while still keeping most of the

power of other frameworks, this simplicity also offers a great flexibility. It isn’t a

typical Model-View-Controller framework, but ties specific URL directly to rele-

vant Ruby code and returns its output in response. It does enable you, however,

to write clean, properly organized applications: separating views from application

code, for instance. Any given operation can be made within those blocks of code,

and the only contract is that they are expected to return a string of characters that

will be sent to the user. This string can be generated directly, or, as it is preferable,

created from the rendering of a specified template that will abstract away the view

part. Sinatra itself can be run on a number of application servers, which range

from small, focused ones such as Thin 4 or general Web servers such as Apache.

HTTParty While fetching and parsing data from an external Web service can

be made using low-level library, HTTParty makes it much easier. One simply

has to specify the URL of the service as well as optional parameters, such as the

developer key in the case of Yahoo! Answers, and HTTParty will take care of

fetching the data and translating it in a native format of its host language, Ruby.

Scala Scala is a high-level language closely based on Java, but taking on ideas

of the other functional languages, such as Haskell or ML, and others from dy-

namic languages like Ruby. It was created in Switzerland in a project led by

Martin Odersky, a lead designer on the Java language itself. Scala offers quasi-

total compatibility with Java, being able to import and export libraries compiled

in any language running on the Java Virtual Machine. In addition, while it sports

a Java-like syntax, it supports type inference, allowing users to skip explicitly




defining the type of each variable. In addition, it supports high-level functions,

pattern matching and an evolution of interfaces and abstract classes called traits,

inspired by Ruby mixins. Among its remarkable features can be found a library

that offers a new perspective on concurrent systems, called actors. This feature,

taken from languages such as Erlang and Smalltalk, allows one developer to con-

ceptualize systems as a series of independent processes called actors, where these

can communicate through the use of referentially-transparent messages. Actors

are implemented using a mailbox, in fact a queue where messages are stored. The

actor can then define its act method to handle these messages, often using pattern

matching to dispatch in view of the type of the message which can be arbitrary.Scala was primarily chosen because it offers access to the wide library of Java

applications. It was also then chosen over Java itself because it is more suited to

explorative programming, where one does not know exactly the shape the result

will take, as was the case at the beginning of this project.

Kestrel Kestrel is a queuing service we use to distribute work tasks amongst the

worker, and to send it from the server where the manager of the workers can reach

it. While it is quite new, it has proven its worth through use at Twitter Inc., where

it powers a lot of the hugely-popular communication service. The particularity of

this service is that it complies with the Memcached protocol. Memcached is the

most widely-used service to store transient data and is used as a cache to avoid

repeating costly operations. While it changes the semantics of this protocol, the

fact that Kestrel respects the simple get and set contract of Memcached allows

the use of a great number of libraries that, while originally made for Memcached

itself, can now be used transparently to send tasks to the Kestrel server. Its basic

semantics are that a set operation associates a key, with a queue, and the payloadgiven within the operation will be added to the end of that named queue. The get

operation instead takes the first element from that same named queue, or returns

a special message if no element can be found. Kestrel itself is implemented as a

daemon in Scala, a high-level language that takes most of its inspiration from Java

and is in fact compiled to Java bytecode, allowing to run seamlessly in the Java

Virtual Machine.




The Google Chart API To build the pie chart graphics in the statistics section

we used the Google Chart tool. This tool provides a free service to dynamically

visualize image charts through a simple URL request to a Google server. The

URL requests are simple to build and very useful to embed graphical elements in

a web page.

There are a lot of chart types available. The chart type is specified by the cht

parameter. Data is specified using the chd parameter. Then it’s possible to set the

format to use for the data, like simple text format or use one of the encoding types

and specify the chart size with the chs parameter. It’s possible to add additional

parameters and each chart’s documentation lists the available ones that include

labels, titles, and colors.

A sample URL starts with http://chart.apis.google.com/chart? and is followed by

all required and optional parameters.

Example:

http://chart.apis.google.com/chart?chs=250x100&chd=t:60,40&cht=p3&chl=Hello|World

Figure 2.3: The sample pie chart



Chapter 3

Related Work

In the past a lot of work has been done in the information retrieval field using the

tools we presented in this chapter. We will explain some of the approaches in the

following paragraphs.

3.1 WordNet

WordNet has been used in numerous natural language processing, since its cre-

ation in the ’90s, such as part of speech tagging, word sense disambiguation, text

categorization, information extraction, with considerable success. Instead the use-

fulness of WordNet in information retrieval applications has been controversial.

Information retrieval is the process of locating documents relevant to a user’s in-

formation needs from a collection of different sources. The user describes his/her

information needs with a query which consists of a number of words. The in-

formation retrieval system compares the query with documents in the collection

and returns the ones that are likely to satisfy the user’s information requirements.

The main weakness of this is that the vocabulary that searchers use is often not

the same as the one by which the information has been indexed. One method to

solve this problem is query expansion[4]. The queries are expanded with terms

29



CHAPTER 3. RELATED WORK 30

that have similar meaning or bear some relation to those in the query, increasing

the chances of matching words in relevant documents. Expanded terms are gener-

ally taken from a thesaurus. Even with query expansions methods no satisfactory

results were really achieved, mainly because of some practical limitations of the

tool WordNet:

• Two terms that seem to be interrelated have different parts of speech in

WordNet. This is the case between stochastic (adjective) and statistic (noun).

Since words in WordNet are grouped on the basis of part of speech in Word-

Net, it is not possible to find a relationship between terms with different

parts of speech.

• Most of relationships between two terms are not found in WordNet. For

example how do we know that Mizuho Bank is a Japanese company?

• Some terms are not included in WordNet (proper name, locations etc).

3.2 WordNet Domains

This tool has been used mainly in the field of word disambiguation. The under-

lying hypothesis is that domain labels, such as Medicine, Architecture and Sport,

provide a useful way to establish semantic relations among word senses, which

can be profitably used during the disambiguation process. One of the first ap-

proaches to the word domain disambiguation process through WordNet domains

was from [5] where words in a text are tagged with a domain label in place of a

sense label, originally taken from the classic WordNet dictionary. They adoptedfrequency measures, based respectively on the intra text frequency and the intra

word frequency of a domain label.

In [6] it’s presented the Domain Relevance Estimation (DRE). Given a certain

domain, DRE distinguishes between relevant and non-relevant texts by means of

a Gaussian Mixture model that describes the frequency distribution of domain

words inside a large-scale corpus; DRE is a fully unsupervised text categorization




technique. The correct identification of the domain of the text is a crucial point for

Domain Driven Disambiguation. Studies on the relevance of a text in the domain

context have been exploited by approaches like [7] where an approach based on

word sense disambiguation is presented. Using WordNet domains and retrieving

the domains available for each synset of a word it’s possible through different

approaches using distance vectors to calculate the most representative domain.

It’s assumed in fact that domains constitute a fundamental semantic property on

which textual coherence is based, such that word senses occurring in a coherent

portion of text tend to maximize their belonging to the same domain.

All these approaches have good results but use only domain information. To im-

prove the system recall, other information should be integrated in the domain-

based approach. For example supervised approaches that make use of local infor-

mation, such as word collocation and grammatical context.

3.3 Name Entity Recognizer

In the past a lot of work has been done using the entity recognizers as a kind

of intelligent parsers. They can recognize name entities more precisely than for

instance regular expressions; which can only identify a proper name but can’t

classify its meaning. So NER are commonly used as information extractors on

formal text as news, websites, as geographical information extractors [9] or as

personal information extractors from emails [8].

Their information retrieval function is very specialized on the kind of material un-

der examination, blogs, news or emails have different structures and kind of data

that require differently characterized NER to effectively return good results. More

over a lot of these recognizers are computational heavy because of the large sets of

training data they have to handle. The optimal solution to obtain the greatest num-

ber of entities from a document would be to combine an environment optimized

NER and a set of regular expressions.




3.4 Query splitting

The topic of query splitting or query segmentation has been analyzed in a lot of

papers. Very different approaches have been tested. The one examined in [11] is

based on retrieved result. The aim of this approach is to find interesting documents

that will link two queries that function as “stepping stones”. This method of pro-

ceeding is particularly useful in the academic and scientific article field. The two

queries can be provided by the user himself or they can be identified by the system

through the examination of the single query provided; this is done with an unsu-

pervised method that analyzes the various documents retrieved from the request

in the query and groups them according to common terms and characteristics.

In [12] it’s proposed an unsupervised approach based on query word-frequency

matrix derived from web statistics.They first adopt the N-Gram model to estimate

the query term’s frequency matrix based on word occurrence statistics on the web.

They then devise a strategy to select principal eigenvectors of the matrix. Finally

they calculate the similarity of query words for segmentation.

In[13] they use a generative query model to recover a query’s underlying concepts

that compose its original segmented form. The model’s parameters are estimated

using an expectation-maximization (EM) algorithm, optimizing the minimum de-

scription length objective function on a partial corpus that is specific to the query.

To augment this unsupervised learning, they incorporate evidence from Wikipedia

to exploit some external knowledge to make sure that the output segments are

well-formed concepts, not just frequent patterns.

The great part of effective approaches to query splitting are done using unsuper-vised methods. There are some natural language based query analysis researches

but they’re often very structured or referred to specific domains, more like natural

language interfaces to databases than natural language analyzers.




3.5 Matching

The query matching subject hasn’t been approached widely but we can find a sig-

nificant research in this article [14] where a generic query is routed to a proper

search service after an analysis by the automated query routing system. Off-line,

Q-Pilot takes as input a set of search engines’ URLs and creates, for each en-

gine, an approximate textual model of that engine’s content or scope, something

conceptually similar to SeCo’s semantic annotation for Service Marts. On-line,

Q-Pilot takes a user query as input, applies a query expansion technique to the

query and then clusters the output of query expansion to suggest multiple topics

that the user may be interested in investigating. Each topic is associated with a set

of search engines, for the query to be routed to, and a phrase that characterizes the

topic. For example, for the query “Python” Q-Pilot enables the user to choose be-

tween movie related search engines under the heading “movie — monty python”

and software-oriented resources under the headings “objected-oriented program-

ming in python” and “jpython — python in Java”. An important key point in

the Q-Pilot design is to use the neighborhood-based identification of search en-

gines’ topics in combination with query expansion. This approach gives quitegood results, as reported on the article; query expansion fills the gap between the

short query and the small number of terms in search engines’ topics. This system,

though quite efficient, is well-suited only for very short, single domain queries.



Chapter 4

The Thesis Project Contribution

4.1 Objective

Complex queries make it possible to extract answers from complex data, rather

than from within a single Web page; but complex data require a data integra-tion process. In the SeCo project this process is query-specific because to answer

queries about very different topics we require intrinsically different data sources.

However, data integration is one of the hardest problems in computing, because it

requires full understanding of the semantics of data sources; as such, it cannot be

done without human intervention. A data source is any data collection accessible

on the Web. The Search Computing motto is that each data source should be fo-

cused on its single domain of expertise (e.g., travels, music, shows, food, movies,

health, genetic diseases) but pairs of data sources which share information can

be linked to each other to build complex results. This classification of the data

in different domain groups, represented by the service marts, is the basis for the

upper level query elaboration that will try to match the input with the available

data sources.

In fact the main objective of the thesis project is to enhance the existing natural

language analyzer framework and add a service mart matching function to match

34



CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 35

the high level query to the services for a future service invocation. In particular we

aim to map user-specified queries with no fixed input forms to SeCo multi-domain

paradigm; this new feature will make it possible to test the output of the natural

language elaboration in the framework.

4.2 Hypothesis

Syntactical hypothesis: The clause-splitting divides in clauses, as the name

says, so, to achieve a good splitting result, each clause should be dedicated to

a defined section of the request. By this we mean that a question regarding a spec-

ified field or domain should be limited to a single clause so that it can be satisfied

by a web service or a group of them all semantically related.

Wrong example: “Is there a restaurant in Los Angeles in the proximity of a musi-

cal theater? I’d prefer thai food”

Figure 4.1: The trees retrieved from the analysis of the query

This request is formally wrong, considering our syntactical clause division, be-

cause the last clause “I’d prefer Thai food” would be completely unrelated to the

first one where other requests about the same domain are made. As visible in the




figure above two trees are retrieved because the parser considers the question fin-

ished when it finds the question mark symbol “?” and therefore it begins a new

tree for the following sentence

Correct Example: “Is there a Thai food restaurant in Los Angeles which is in the

proximity of a musical theatre? ”

Figure 4.2: The tree for the correct example

Another downside of this kind of splitting is the misinterpretation of requests

linked by conjunctions. Because of the nature of the parser only the clauses (ei-

ther relative, subordinate or sentences) are recognized so a request given as a list

of conditions linked only by conjunction and no verbs is identified in a singular

long sentence.

Wrong example: “I’d like a Thai Restaurant in LA, near a movie theater that

shows a horror movie and a hotel to spend the night with a spa.”

This example will be parsed in “I’d like a Thai Restaurant in LA, near a movie

theater that”, “shows” and “a horror movie and a hotel to spend the night with

a spa”. This splitting would be completely wrong in terms of domains division,




because the information on the movie should be linked to the previous part and

the information about the hotel should be in an independent sentence. The right

structure of this request would be:

“I’d like a Thai Restaurant in LA. A near movie theater with a comedy movie. I

would like to spend the night in a hotel nearby.”

The easy solution to overcome splitting problems is to put a full stop after any

part of the request that concerns a specific subject. This will limit the spreading

of the concepts in the sentence and keep them all in a single sub-query. The

main problem from this approach is the loss of connection information among the

requests that could be retrieved deeply analyzing the link between the clauses.

Example: “I’d like a Hotel at the Bahamas where I can go snorkeling.”

The conjunction “where” indicates that the user wants to find the activity re-

quested in the second sentence in the place named in the first one.If we split the

sentence and transform the second one in an independent one as in:

I’d like a Hotel at the Bahamas. I want to go snorkeling there.

The recognition of the link between the two sentences is not immediate anymore

and it requires a deeper analysis to be detected.

Service Marts’ Hypothesis: The complex structure of service marts explained

in Chapter 2 has a lot of characteristics that don’t specifically concern the func-

tion of query/service mart matching. So we decided to simplify the model to a

“semantic” version of it.

Below is the uml used to project the object used in the thesis application:




Figure 4.3: The semantic modelization of the Service Mart

As you can see we only kept the elements with a semantic value, so we omitted therepresentation of repeating groups, because they only indicate a structural value.

We also hypothesized that we had only semantic attributes and not quality indica-

tors like ranking. This feature is not yet retrievable from the queries through the

framework we built so we imagined that for now it could be treated automatically

(intrinsic property of the order of the results) or parametrically (the user is given

the opportunity to decide about it).

Another feature that has not been considered is the join relation between different

service marts or access patterns. This feature is a very important one in the SeCo

architecture. The possibility to link different search services with join paths gives

the power to answer the greatest part of multi-domain queries; in fact it’s assumed

that a user won’t repeat every bit of data in the request for the number of times

the splitting in domains requires. The repetition of the linking parameter through

a join path is vital in these cases. With our implementation we can match only a

smaller range of multi-domain queries to service marts.




A third simplification is about Service Attributes that are presented as the semantic

version of Service Interfaces.

4.3 Query Analysis

From the corpus of entries retrieved, we proceeded to the creation of techniques

that will allow us to recognize and extract the domains from a question. This pro-

cedure spanned many different phases, each one of them requiring different steps,

from the elaboration of diverse strategies, the application of them on the corpus,

and the evaluation of the results. All the entries inserted in the Sift database are

processed in two main steps:

• the parsing

• the splitting

4.3.1 The parsing

It transforms the linear structure of the sentence in a tree representation of the

grammatical elements. This is the starting point of the splitting in multiple do-

mains. The tool used in this phase is the Stanford Natural Language Parser. It

is a Java-based library, that, accompanied with a trained corpus of data for the

English language, will obtain a parse tree with each atom being annotated with its

role, and the different structures corresponding to the parts of the sentence (object,

verb, complement), being grouped into the tree.

4.3.2 The splitting

This step divides the input entry into multiple parts where each part should cor-

respond to a single domain that will be searched. The precision of the structure




obtained in the first part becomes very important, as it is exploiting that struc-

ture and its properties that we will find many opportunities to split an entry into

many different parts. Many different techniques were thought about, but two main

techniques were retained and tested on the input:

• First Level Split: This split is done directly at the first level of the sentence,

on the assumption that the upper levels of the trees will be divided in differ-

ent sub-sentences where each one corresponds to a single unique domain.

• Clause Level Split: It splits directly according to the parser’s recognition of

clauses, either subjunctive or relative. While splitting a sentence into dif-

ferent parts is an important step towards getting the final domains out of an

entry, it still returns a response that is too coarse. We must thus reduce each

part to its simplest objects, and the ones that will characterize the domain

in which we find ourselves. Thus, from each part, we keep only the nouns

and the verb if they correspond to meaningful actions (e.g drive, rent, cure).

The clauses recognized by the parser and filtered by the splitter are: plain

sentences, relative clauses, subordinate clauses, relative clauses (questions).

The identification of only nouns and verbs during the splitting brings a deficiency

of information about the quality of the request because adverbs and adjectives are

discarded. This choice was done because only names and verbs are annotated

with domains in the WordNet Domains database. The lacking of quality defining

expressions during this stage of the process has to be compensated in the following

steps of the query analysis.

4.4 The Extraction of the Data Types

The third step in the query analysis process is the extraction of different types of

elements in the query. The typical form that a user has to fill when he’s using a

web service usually requires data that belong to a lot of different formats: dates,

prices, names, titles are only few of them.




To efficiently analyze a natural language query the system has to recognize and

identify the greatest number of parameters in input. If a parameter can be la-

beled with its format the probability of a good matching between the query and

the service attributes will be higher. For this reason we have decided to use a

Name Entity Recognizer to extract name entities from the queries. We considered

the use of different NERs and we finally decided to use the one implemented by

the Stanford group because it was completely compatible with the libraries from

the parser already used. This NER can recognize entities in the form of: Per-

sons, Locations and Organizations. These “proper noun” words are recognized by

the mean of a large training set that functions as a great database of information.There are a lot of other NERs that perform in a similar way and can recognize

more entities in the form of numbers, dates etc..., the choice on this particular

NER was firstly because of the compatibility with the project already developed

and secondly because we believe that a more efficient recognition of “standardiz-

able types” can be achieved with the use of regular expression. This led us toward

a simpler NER rather than a complex and multifunction one. We call “standard-

izable types” all the data types that follow a standard pattern in their expression.

Prices for instance are always numbers followed or preceded by the symbol or thename of the currency, titles, if written correctly, are delimited by double quotes

(“I’m a Title”), distances have the same characteristics as prices with a unit of

measurement symbol and so on. The choice of using regular expression was done

because having the ability to change the expression in our program lifted us from

being dependent on a NER. These can be very useful for entity recognition on

“natural” words, machine learning algorithms and big training sets are not an easy

task to handle, but not be powerful enough with standardizable types.

4.5 Mapping to domains

From the basic objects identified in the clauses, nouns and verbs, we use another

set of tools and techniques to extract domains out of them. Subsequently in the

SeCo application domains can be mapped to a web service. In order to do this, we

focused on the tools provided by the WordNet project, and especially the add-on




of WordNet Domains. The approach used is to parse the dictionary of WordNet,

which is organized in words which relate to one or more synonym sets or senses

of a word, also called synsets. Each synset has a unique identifier consisting

of its offset within the WordNet database. We use this identifier to connect a

domain to its associated domains within the WordNet-Domains database, where

the key is the synset offset and its values are one or more domains. In fig. 4.4 the

relationships we follow to get the domains are represented.

Figure 4.4: The WordNet Domains Hierarchy

The domain retrieval process can result in a large number of domains from a sin-gle word. The perfect approach (from the human mind point of view) would be

to identify for every word which one is the sense which it refers to, in the given

sentence, and retrieve the domains accordingly. Given the inner difficulty of this

task in order to get the most relevant domains, we use the tf-idf[10] information

retrieval technique. This is a sorting mechanism that calculates the importance of

a single domain by its relative presence in a single word, over how common it is

across all the domains we retrieved from the objects of the sub-entry. A second

technique that was evaluated is to retrieve the domain relationship directly from

WordNet, which gives a word definition a relationship to another word of which

it is the topic. WordNet is organized as an index of words to their possible senses,

and a database containing details about such senses. In particular, information

about the relationship between the current senses and others is kept in that file.

There are many kind of relationships, such as is-a, is-part-of or, as we wish to ex-

tract, is-member-of-this-domain. These relationships allow one to go from sense,

or synset, to sense, forming a graph spanning the whole of the database. The




approach using WordNet topics (i.e. the is-member-of-this-domain relationship)

was discarded because of the scarce data in the database, in fact only for very few

words a relationship of such kind was available.

To retrieve the greatest number of information from the queries another domain

retrieving method was implemented. This method exploits the function of the

Name Entity Recognizer. The extraction of the domains from the database of

WordNet domains interests only names and verbs, since only those are retrieved

by the parser. With the use of the NER we can recognize some entities, as ex-

plained before, and then use their definition to retrieve the domains. This will

allow us not to lose much information due to the presence of proper nouns.

Example: “I want to go to Los Angeles by plane from Milan and find a hotel near

George Clooney’s house”

Entities recognized:

• Locations: Los Angeles/Milan

• Organizations: none

• Persons: George Clooney

Words from which the domains will be retrieved: want, go, Location, plane, Lo-

cation, find, hotel, Person, house

4.5.1 Methods to improve the domain score

The tf-idf method of ordering domains for each sub-entry gives quite good results

but we believe that it can be improved with the use of other methods that alter and

tweak the scores. We propose three methods:

• Most frequent couples: we calculate the frequency of the couples of do-

mains among the queries already examined and assign a bonus to the most




frequent couples. Using this criteria “distant” or very different domains can

be assigned a higher score on the basis that they’ve been found together in

a query a lot of times.

• Most frequent couples in Service Marts: the same approach can be ap-

plied to the scoring of the Service Marts domains. An offline analysis can be

done and a bonus can be assigned to the most frequent couples of domains

among the Service Marts annotations.

• Nearest domains: a bonus is assigned to the domains in the query that are

“near” considering the distance on the WordNet domains tree.

Only the third method was actually implemented for many reasons, first and fore-

most the absence of a testable, reliable and numerous database of queries and

service marts.

Other researches along this line in the future can involve more complex data min-

ing methods and approaches. The biggest problem in this domain scoring ap-

proach is the small number of domains. Even if we could achieve a perfect or-

dered list of domains for each sub-entry the meaning would be very poor wrt thepossible available annotations used on the service marts side. Another possible

approach to this problem would be to somehow use the retrieved domains to find

some matching synsets that could be useful in the following matching process.

4.6 The Service Mart Repository

The SeCo project is still a work in progress and the registration for the service

marts is not active yet. Therefore, to efficiently test our query analysis and match-

ing processes we decided to create a list of fictitious service marts with character-

istics and parameters very close to the real ones. We used as model and inspiration

the ones presented in the YQL database. The service marts semantic value spans

over a great number of domains and they’re complete with data types descriptions

and multiple access patterns. Thus we populated a repository with approximately

70 service marts that we used in our experiments.




4.7 Map sub-queries to Service Marts

To integrate the Sift application with the main SeCo project a “Query to Service-

Mart” mapping is needed. This mapping is based on the domains associated to

each one of the keywords found in the query and the corresponding semantic def-

inition of the ServiceMart. From the available list of service marts we extract for

each sub-query a list of suitable ones according to their semantic annotation. The

scoring system used in the matching to order the retrieved service marts is re-

ferred to the individual score each matched domain had according to the previous

calculations.

This is a quite simple approach to the matching problem but nonetheless quite

effective because of the good scoring system for the domains used previously.

A further development to this approach would be to consider also other semantic

annotation that could be used for service marts, like synsets. To support this

kind of annotation we would have to extract domains from those synsets and then

choose the right service marts according to their new semantic description.

4.8 Map sub-queries to Access Patterns

Once all the sub-queries are matched with a list of potentially compatible service

marts, we have to specifically match the data we previously retrieved to the at-

tributes in each Access Pattern. Access Patterns are a sub-entity of the service

marts and they contain all the service definitions that ultimately have to be con-

nected to the parameter in input from the user. The ultimate goal of this task is

to match each of the identified sub-queries to one or more services, handled by

the Access Patterns, so that the following processing of the query can take place

according to a well-defined query execution strategy. Every single data entity re-

quired by the services has to have a counterpart in the user request so that the

former can be invoked and give back a result.




Figure 4.5: Example of sub-query/AP matching

Every Access Pattern is composed of a number of service attributes that define

the searching capabilities of the mart. These service attributes are annotated se-mantically with domains and synsets. We also hyphotyze that the service provider

will indicate for every attribute a data type chosen in the enumeration we defined

in the uml of fig.4.3. With the definition of these annotations we can then match

every parameter, starting from the mandatory ones, to the available access pat-

terns. The matching is done respecting the order of the data both in the Extracted

Data structure as in the Access Pattern one. We assume that this will be an advan-

tage for temporal and space parameters that are usually placed in a certain order:

Start->Destination, DateOfDeparture->DateOfReturn.

The matching process is not a trivial one. Not all the parameters can be univocably

matched. For instance an imprecision in the analysis can miss the name of a

location or an organization; therefore when these parameters will be requested

from the Service Attributes no matching will be found. For this reason we decided

not to develop a singular static matching but a dynamic one that checks more than

one data type for each request according to the nature of the data requested. For

instance if a word labeled with the type of Title is not available in the sub-query

examined then a simple expression labeled “word” will match. The same wasdone with all the numeric types, like dates, prices and times.

4.8.1 The semantic name matching

A different analysis was done on name entities. These are special kind of enti-

ties because differently from the others they have a semantic annotation that we




retrieved from WordNet and WordNet domains. This annotations can be useful

when more than one name parameter is needed. Through the calculation of a

matching score very similar to the one done previously for the sub-query/service

mart matching we sort the eligible names by highest compatibility and choose the

more suitable one.

4.8.2 Evaluation Criteria and Statistics

Some improvements from the original Sift application were required to filter ef-ficiently the queries among the raw set we acquired from our source Yahoo! An-

swers.

The Yahoo! Answers input structure requires the users to type a “title” and a

proper question in the form, the Sift application only acquires the “title” of the

question since often the complete text of the question is too long and filled with

other objects such as links, very unuseful for our analysis.

Due to this choice a lot of filtering to eliminate incomplete queries or inconclusive“titles” has to be done.

The preprocessing

During the preprocessing of the queries a basic filtering is manually applied and

entries are deleted if no question is asked or there are grammatical or spelling

errors in the keywords of the phrase. An improvement to this phase has beenadded to the application providing a correction form, for every entry retrieved,

which can be useful to correct and update the sentences with typos, spelling errors

and abbreviations without having to eliminate them. Also the option to eliminate

an entry or more entries altogether has been added.




The query evaluation criteria

Since all the queries we acquire have to be evaluated manually (by a human be-

ing) to separate the multi-domain ones, optimal for our purposes, from the single

domain or ill-defined ones, a general evaluation criteria is required.

The evaluation, originally based on a 5 star rating has been changed to a 3 star

rating and a 1 to 3 score must be assigned to every query during the process.

We define here an evaluation criteria to standardize this phase and make it possible

to everyone who is evaluating the queries to do it within some guidelines.

• 1 Star - Single domain query (Not useful for our purposes)

• 2 Stars - Ambiguous multi-domain query ( A multi-domain query not suit-

able for our research i.e. “What is the best birthday gift for my wife?”)

• 3 Stars - Multi-domain query

Statistics

Once the entries are processed a 6-splitted screen is presented to the user. These

6 sections illustrate respectively the results of: the first-level split, the clause split,

the domain extraction based on the clause split and the optimized version of it

and the service mart matching retrieved. These aspects of the processing can be

evaluated manually with a 3 star rating. To give a more user friendly approach to

the results some statistics graphs have been added to the sift application.

First-level split

The first level split is the task that takes the considered entry and splits it based on

the first level division found in the stanford parsing tree. It finds the first internal

node that has more than one child, and takes each child as a different section.




Then in each section-subtree we look for interesting elements, nouns and verb

atoms and take them as the objects.

The evaluation criteria is defined as:

• 1 Star - Completely inadequate splitting considering the domains of the

entry

• 2 Stars - Lack of precision in the splitting or in the extraction of the key-

words

• 3 Stars - Precise division of domains and keywords

The statistics count the star ratings, so that it’s possible to evaluate the split method

among a great number of entries. Also a statistic on the number of keywords we

are presented with has been implemented.

Figure 4.6: Example statistics first level split

Clause split

In this splitting method the tree is visited in a depth-first, left-to-right manner,

buffering elements in a domain until a new clause is encountered. From each

buffered part, we then take all the leaves and filter out the ones that are neither

nouns nor verbs, and send those as the resulting objects.





• 1 Star - Completely inadequate splitting considering the domains of the

entry

• 2 Stars - Lack of precision in the splitting or in the extraction of the key-

words

• 3 Stars - Precise division of domains and keywords

As we did for the first-level split a statistic on the star ratings is implemented to

evaluate the method, and also a statistic on the number of split clauses for each

sub-entry that will be useful for the analysis of the results.

Figure 4.7: Example Statistics for clausesplit

Domains Extraction

After the evaluation of a number of entries we could observe that the splitting

done by the first level split had lower ratings than the one done by the clause split

so the latter has been chosen to extract the domains.


• 1 Star - Completely inadequate domain extraction

• 2 Stars - Presence of inadequate and off topic domains with an high score




• 3 Stars - Precise extraction of the domains

In this section the most useful statistic is the one on the number of domains ex-

tracted. The main problem in this extraction is the vast number of domains ex-

tracted from a single entry with a relatively small number of keywords. This is

due to the presence of multiple synsets corresponding to the single words in the

WordNet database.

Figure 4.8: Example statistics for domains extraction



Chapter 5

Implementation

5.1 The system general architecture

Figure 5.1: The Architecture schema

The framework that powers this research environment is centered on two top level

tasks: the creation of the corpus and the use of this corpus within the context of

query analysis. The tools created to support and enhance these tasks are flexible

and can accommodate the diverging needs that are centered around the single

52



CHAPTER 5. IMPLEMENTATION 53

set of data. The center of all is the web front-end that powers the creation of

the corpus of queries as well as functions as a visualization tool for the outputs

of the algorithms, employed to analyze the queries and extract the domains in

the underlying application, and for the statistical results section. The front end

communicates with the outside through the retrieving of questions from Yahoo!

Answers from their web service. This feature is on request so the user that is

browsing the Sift web page can ask to retrieve questions from the outside. These

questions will be showed as new unrated entries and will be saved in the database.

The load of the process required to analyze the queries is non-negligible, both

in terms of CPU usage as well as memory, so it’s not advisable to require theextraction and the analysis to be done in real time in the same environment as

the database and the Web front-end. Therefore a mechanism to offload the work

into another computer has been devised. This is based on a standard and simple

architecture for background workers, the Web front-end, or the user through the

command line, can post work items on a queuing server where it will be picked

up by one of the clients, the first one who is available to compute. The client then

can process the task, given the input parameters to elaborate, and store the results

back in the database from which the web front end will retrieve them later after auser request.

5.2 The Sift Application

Sift is the Web Application composed of the front-end and the tool used to extract

data from the Yahoo! Answers Web Service. It is based on the Sinatra framework,

and thus it has been written in the Ruby programming language. In addition tothe main Sinatra library for web application development, it imports the Ruby

libraries used to interface with the CouchDB server, the Kestrel queue service as

well as the Yahoo! Answers Web Service. The application is divided into three

principal parts: the Models, the Controllers and the Views.




Figure 5.2: The models

There are two models in Sift, as you can see in Fig.5.2. The first one refers to

the entries that are stored in the document database. These entries contains the

input retrieved from Yahoo! Answers, they also contain the rating given to the

entry itself, the results of the elaboration on the entry and their evaluations. The

creation or update of an entry makes the tool automatically send a message to the

queue server where later the task can be picked up by the background worker,

application that will analyze it. The second model corresponds to the result of a

query made to the Yahoo! Answers website, before it is inserted in the document

database where it becomes an Entry. It contains the fields of the content provided

by the Yahoo! API, which, of interest to us, are the category identifier, the question

identifier, the question title and the body.

The controllers in Sift correspond to a series of functions that are called when a

pattern matching a URL is matched. The most important method in this case is

the one for the index page, where most of the work is done. Its basic task is to

prepare the list of entries that are to be shown to the user. In order to do this, it

takes a list of parameters fed by the user. These can include the number of items to

show, the page to show, the ordering, a filtering by the rating given to an entry or

the lack of it. The interface allows for the manual creation of an entry, for which

a method is thus available. Also a method to modify and delete one or multiple

entries has been implemented so that the filtering and elaboration of the input can

be easier. Other methods available are methods for the user to rate an entry, either

as a whole, in the case of the extraction of the corpus, or for an specific feature,

useful in the matter of the evaluation of the strategies.




Figure 5.3: Screen of the sift application

The last part is the view module, where a template is processed, taking as input

the different variables prepared by the controller. In the index page, where thelist of entries is shown, the output HTML contains the list of all the entries, with

each entry containing the results of the processing done if there was any, although

this part is hidden at first. In the index page a list of summaries is shown and

can be navigated through, either using the mouse or the keyboard. Elements can

be given a rating by clicking the corresponding star to the right of it. It is also

possible to change the rating on more than one element at once by selecting them

first and then using the drop-down action menu or the keyboard shortcuts to give

it a new rating. To see the complete details for a single entry one has to click on

it and the screen will show all the retrieved data, the data has to be previously

processed by the background worker system, in which case, it is possible to see

the alternative strategies and rate them individually. Otherwise the phrase “This

entry hasn’t been parsed yet” will appear on the screen. This interface was created

using the jQuery toolkit for JavaScript, which allows a high-level view of the web

page, allowing to query and manipulate elements, as well as make asynchronous

calls to the servers.




5.2.1 Bee - Distributed Background Processing

All the procedures and analysis functions that were introduced in the previous

chapters will find their place in the framework in this section as modules of the

background worker.

In the framework of the background worker, used to test query analysis and do-

main extraction strategies, we can find the great part of the tool chain created

expressly for this research. It was decided to create such a system in order to be

able to allow asynchronous processing, as well as being able to offload the main

server. The system, nicknamed Bee, was created in Scala and uses libraries to con-

vert to and from JSON, to access CouchDb as well as the Kestrel queue service.

Refer to the fig. 5.4 for an overview of the classes that compose this framework.

Figure 5.4: The Bee Structure




Tasks

The central concept in Bee is that of a task. A task is a single unit of work that

is coded by the user of the library. Tasks are centered around one function, run,

which takes as parameters the original input fed from the queue, as well as the

results of the previous tasks. In turn, the tasks return the results of their calcula-

tions in the JSON format. These tasks are organized in chains, where each chain

is given a unique identifier. It is the ultimate result of the chain that is stored back

into the document database.

Method Description

setup(configuration) Optionally implemented, provides a mean for the task to

access its configuration, and use it to set itself up, for

example by loading data dictionaries.This is only done

once when the task is first loaded, so it is a good

opportunity to cache things.

identifier: string This method returns the unique identifier of the task within

the chain.

run(inputParams): json This is the core method of the task. It takes as input a Map

from the queue as well as the results of the previous tasks

in the chain, arranged according to their identifier. The

output of the function should be a json value. Any

exception thrown by this task will interrupt the chain and

will be stored in the errors fields in the database.

version:string Allows a task to report its version, forcing the

re-computation of all the elements in a chain starting from

this task.

Table 5.1: The task interface

Workers

The workers interact with the queue service and overview the execution of the

tasks in their chain. At the initialization, they are given a unique identifier, the

name of the chain as defined in the configuration. They then setup every task

before opening a connection on the queue, waiting for messages in the queue




named by their identifier. Once a message is received, they first check into the

database to see if this particular instance of the chain had already been executed.

If that is the case, the existing data is fetched and parsed. The tasks will be skipped

as long as no errors have been encountered and no version of a task has changed.

Once all the remaining tasks have been executed, the worker asks the database

actor to store the updated document before resuming listening on the queue.

Splitting Tasks

Splitting Tasks are a specialization of a generic task, in which the parsing and

serialization is already taken care of. The user only has to define a function, split,

that will transform the input from a tree-based representation to a series of parts.

Each part is an instance of a class composed of two fields, the first being the

phrase, or sub-sentence that is considered as that part. The second element is the

list of the objects retrieved from that phrase, where each object is a couple made

of the object itself as well as its part of speech (e.g. verb, noun or adjective).

Domain Extraction and matching task

These tasks have been united in a single big task because of implementation and

resources requirements. They correspond to the third phase of the query analysis

process, where the different parts of a sentence are analyzed in order to obtain a

series of domains that are later mapped to different query services. This interfaces

once again have one single method to implement. This method, named extract,

takes the list of parts obtained from the previous operation, and have to return, foreach part of the sentence, a list of possible domains and a list of matched Service

Marts. If nothing of importance is found or if a word is not recognized the output

list can be empty.




5.3 Procedures

5.3.1 Parsing

Once the framework was established and stabilized, the first tasks were imple-

mented. The first of these were the parsing strategies, which take as input the

natural language sentence and output the grammatical structure. This output will

have the form of an arbitrarily-deep tree, where a leaf represents an atom, a word

of the sentence, while an internal node represents a grouping of these words insome structure, for example in a noun phrase or verb phrase.

Parsers evaluated Different parsers were evaluated to test their performance

wrt our elaboration needs. The first parser to be evaluated was the Stanford Nat-

ural Language Parser. Distributed as a Java library, it requires little code to use.

One simply needs to load the parser with the chosen training data file, and then

apply the parser to a sentence to get a resulting tree. This tree is then transformed

in order to take it from the native tree representation of the Stanford Parser intoa generic one that is to be used by the later tasks. Here is an example on how to

load and apply the parser, in Scala.

/ / d a t a F i l e p o i n t s t o t h e t r a i n i n g d a t a

/ / s e t o n t h e h a r d d r i v e

/ / i n p u t c o n t a i n s t h e n a t u r a l l a n g u a g e s e n t e n c e

i m p o r t e d u . s t a n f o r d . n l p . p a r s e r s . _

v a l p a r s e r = n e w L e x i c a l i z e d P a r s e r ( d a t a F i l e )

v a l t r e e = p a r s e r . a p p l y ( i n p u t )

Another parser tested was the Shallow Parser, developed at the University of Illi-

nois at Urbana-Champaign. This takes a different approach to obtain the final

result. It works by using a series of different tools that will process the input

into a more and more complex form. The first of these steps is to sanitize the




input, make sure that every element is well tokenized; that is, every element in

the sentence is spaced out, even the punctuation. It also produces some slight

transformations and normalization operations. The output of this first operation

is then sent to a second program, which takes care of tagging each element of

the sentence with its most probable part of speech, be it noun, verb, adjective or

other. This is then finally sent to a server called the chunking server, which takes

the annotated input and groups, or chunks, elements into what it thinks are the

primordial structure of the sentence. The shallow name thus comes from the fact

that this grouping operation is only done at one level, which means that the output

could be formally defined as a sequence of elements which can either be atoms,or sub-sequences of such atoms. This output, given as text by the server, is then

parsed by the task and put into a tree representation, although it is only going to

be one level deep.

5.3.2 Sentence Splitting Strategies

Once we have a tree with a satisfying parsing structure, in our case we chose theStanford tree version, we proceed to the division of that structure into many parts,

with the expectation that each part will correspond to a single semantic domain.

In output, each part of the sentence is represented by an instance of the “part”

class, which contains the extracted sub-sentence as well as the objects that are

considered important to the definition of the domain.

First-Level Split

A first strategy to split the sentences is to suppose that the first level at which a

separation of the sentence occurs defines the various domains. The purpose of the

task is thus to find the first internal node that has more than one child, and take

each child as a different part from which a domain will be extracted. From each

sub-tree, we look for interesting elements, nouns and verb atoms and take them

as objects. While this first attempt at splitting the tree is simple and does not take




into account the subtleties of the resulting parse tree, it gives a good base line and

provides a jumping point from which we can explore better techniques.

Clauses Extraction

Given the fact that sentences are most of the time organized in subject-verb-object

form, and that the object is often the one that has the most chances of having a

subjunctive or relative clause, we can expect the tree to be leaning most of the time

on the right, a fact that the previous technique does not take into account. In orderto fix that, a second technique has been implemented, where the tree is visited

in a depth-first, left-to-right manner, buffering elements in a domain until a new

clause is encountered. Such a clause is encountered when we find an internal node

labeled as: plain sentences, relative clauses, subordinate clauses, relative clauses

in the form of questions. From each buffered part, we then take all the leaves and

filter out the ones that are neither nouns nor verbs, and send those as the resulting

objects.

The information Extraction and Matching algorithm In the following para-

graphs we will go into the implementation details of the core activity of the thesis

project: the information extraction and matching algorithm.

In the figure below you can see the complete structure of the program, from the

extraction of the Part object (a sub-entry) to the matching of each of them to a

proper service mart.




Figure 5.5: The algorithm schema after the splitting

5.3.3 Information extraction

Part I

In this stage of the implementation we worked on the sub-entries to extract the

greatest amount of information out of them.

This was done through a 4-step process the structure of which is illustrated in theschema:

Figure 5.6: The Information Extraction Flow

Name Entity Extraction The name entity extraction tool is used to examine

each object extracted from the sub-entry. To do this we first initialize the classifier

with the training set, then using as input the string value to examine we extract

a list structured as: List<Triple<String,Integer,Integer>>. This structure contains

the name of the entities extracted, either Location, Organization or Person, and

the offsets of the values they refer to in the string examined.

/ / i n i t i a l i z e t h e c l a s s i f i e r t r a i n i n g s e t

v a r s e r i a l i z e d C l a s s i f i e r = " c l a s s i f i e r s / n e r - e n g - i e . c r f - 3 - a l l 2 0 0 8 - d i s t s i m . s e r . g z "

/ / i n i t i a l i z e t h e c l a s s i f i e r

v a r c l a s s i f i e r = C R F C l a s s i f i e r . g e t C l a s s i f i e r N o E x c e p t i o n s ( s e r i a l i z e d C l a s s i f i e r )

/ / e x t r a c t t h e i n f o r m a t i o n

v a r g = c l a s s i f i e r . t e s t S t r i n g A n d G e t C h a r a c t e r O f f s e t s ( s e n t e n c e )

From the variable g then we get the types of entities extracted and label the values

examined accordingly. This entity labels will be then used to extract the domains

from WordNet, task that otherwise would be impossible for proper nouns.

Domain Extraction All the sub-entries in output from the splitting methods are

stored in specific objects called “part”. These objects contain the original version

of the sentence in the sub-entry, all the noun and verb object retrieved by WordNet

as well as the nouns retrieved by the Name Entity Recognizer.

In the schema below you can see the structure of the retrieval of the domains from

each object saved in the part structure (O1, O2, O3). This retrieval is done by the

exploration of the WordNet domains database; for each one of the objects there

may be more than one group of domains to be retrieved, groups D1, D2, D3 refer

all to O1, due to the subdivision in multiple synsets, fig4.4.




Figure 5.7: Domain extraction algorithm structure

Tf-Idf After we extracted the domains from the basic objects, in order to get

the most relevant domains, we use the tf-idf information retrieval technique as a

sorting mechanism, calculating the importance of a single domain by its relative

presence in a single word over how common it is across all the domains we re-

trieved from the objects of the part. This technique allows us to calculate how

much information carries a word within a set of document. It has two principal

components: the Inverse Document Frequency, or idf, takes into account that a

word that is very common within its whole collection does not give much infor-

mation, as it lacks uniqueness. So points are removed from a word if it appears

in too many documents. On the other hand, a word is considered as important to

a document as the times it appears in it, which the tf or Total Frequency indica-

tor purports to calculate. These two results are combined to give a final score to

each unique word within the part of the sentence, which we then use to order the

results, starting from the highest score.

In particular the formula used in our case was:

t f f o r d o m a i n i = ninTot

ni =t h e n u m b e r o f t i m e s t h e d o m a i n

ii s e n c o u n t e r e d

nTot = t h e t o t a l n u m b e r o f d o m a i n s i n t h e s u b - q u e r y

i d f f o r d o m a i n i = log D

a(i)

D =t h e t o t a l n u m b e r o f s e n s e s f o r t h e c o n s i d e r e d p a r t




a(i) =t h e t o t a l n u m b e r o f s e n s e s i n w h i c h t h e d o m a i n

ia p p e a r s

Bonus Score Refinement After we obtained an ordered list of domains for each

part we researched a method to improve and refine the scores. We decided to im-

plement a feature that assigned a bonus score to each domain that had a connection

to the other domains of the part. This connection is based on the relation between

each word in the WordNet domain hierarchy. We reproduced the model of the

hierarchy in our program and with a simple method we can define if two domains

are related, as father-son or siblings, or unrelated.

Figure 5.8: A graphical sample of a substructure of the WordNet hierarchy

We decided to advantage with a bonus score the near domains (siblings) in each

group retrieved. This was done with the assumption that the probability that re-

lated domains in a given hierarchy are related also in a sentence is higher. So forexample if we are given a sentence as in the following example:

We can see the domains retrieved from this sub-query on the left and the optimized

version on the right. The domains whose score has been increased are: geogra-

phy, tourism, transport, politics, geology. Respectively the groupings of domains

which are siblings are geography/geology (as seen in fig. 5.8) and tourism/politics/transport.




t o t r a v e l t o N o r t h K o r e a f r o m

A m e r i c a ?

3 . 2 3 3 1 g e o g r a p h y

1 . 5 2 2 2 t o u r i s m

0 . 7 6 1 1 h i s t o r y

0 . 5 0 7 4 t r a n s p o r t

0 . 3 8 0 5 p o l i t i c s

0 . 3 8 0 5 g e o l o g y

t o t r a v e l t o N o r t h K o r e a f r o m

A m e r i c a ?

3 . 8 7 9 7 g e o g r a p h y

1 . 8 2 6 7 t o u r i s m

0 . 7 6 1 1 h i s t o r y

0 . 6 0 8 9 t r a n s p o r t

0 . 4 5 6 6 p o l i t i c s

0 . 4 5 6 6 g e o l o g y

Table 5.2: Bonus Refinement example

We can see that more relevant domains get an increase in value though the order-

ing of the domains doesn’t change in this case.

Part II

Extraction of data types Adding to the entity extraction previously done with

the NER, we also implemented various data type extraction using regular expres-

sions.

We decided to use a single expression for the extraction of each data type with the

main characteristic of the type in question. This choice was done so that the only

change needed to update the extraction algorithm for each data type would be the

modification of the regular expression.

Here we’ll do an overview of the regular expressions used.

Titles: All the expressions included in double quotes

" ( \ " [ ^ \ " ] + \ " ) + "

Dates: All the dates in the American format mm/dd/yyyy

" ( 0 [ 1 - 9 ] | 1 [ 0 1 2 ] ) [ - / . ] ( 0 [ 1 - 9 ] | [ 1 2 ] [ 0 - 9 ] | 3 [ 0 1 ] ) [ - / . ]

( 2 0 [ 0 - 9 ] [ 0 - 9 ] | 1 9 [ 0 - 9 ] [ 0 - 9 ] ) "




Prices: All the numbers followed by the name of a currency (USD,Pound,Euro...)

" [ \ s ] ( [ 0 - 9 ] + ( . [ 0 - 9 ] { 2 } ) ? ) ( [ \ s ] ? ) ( U S D | u s d | d o l l a r s | d o l l a r |

P o u n d s | p o u n d s | E u r o | e u r o s | E u r o s ) "

Numbers: Any real number, with optional decimal point and numbers after the

decimal, and optional positive (+) or negative (-) designation.

" [ \ s ] [ - + ] ? \ d + ( \ . \ d + ) ? [ \ s ]

Time: Times separated by either : or . It will match a 24 hour time, or a 12 hour

time with AM or PM specified. Allows 0-59 minutes, and 0-59 seconds. Seconds

are not required.

" [ \ s ] ( ( ( ( [ 0 ] ? [ 1 - 9 ] | 1 [ 0 - 2 ] ) ) : ( [ 0 - 5 ] [ 0 - 9 ] ) ( : [ 0 - 5 ] [ 0 - 9 ] ) ? ( [ \ s ] ) ?

( A M | a m | a M | A m | P M | p m | p M | P m ) ) | ( ( [ 0 ] ? [ 0 - 9 ] | 1 [ 0 - 9 ] | 2 [ 0 - 3 ] ) :

( [ 0 - 5 ] [ 0 - 9 ] ) ( : [ 0 - 5 ] [ 0 - 9 ] ) ? ) ) "

All the data retrieved are stored in a DataType object which is unique for every

sub-entry and contains all the information about it.

5.3.4 Service Mart Semi-Automatic Generation

The service mart generation we implemented is structured in different phases:

Creation of “artificial” services:

For each service we have to define a new object with all the characteristics a real

one would have. This is the typical structure of a service attribute:




Group Service Attributes

Time_period_1 day,month,year,date

Time_period_2 hour,minute,time

ListGeography location,zipcode,city,streetname,streetnumber,country

ListJourney (city,city),(location,location),(zipcode,zipcode)

Rating Rating value

Distance Distance value

Characterizing

Parameters Mixed parameters

Table 5.3: List of Groups for the Service Mart Generation

c l a s s S e r v i c e A t t r i b u t e :

v a r I d : S t r i n g = t h e I d o f t h e s e r v i c e

v a r D a t a T e m p : d a t a T e m p l a t e . V a l u e = t h e t y p e o f d a t a e x p e c t e d

v a r S e m a n t i c s : S e m a n t i c A n n o t a t i o n = t h e s e m a n t i c a n n o t a t i o n s y n s e t s + d o m a i n s

v a r T y p : T y p e . V a l u e = t h e e x p e c t e d f o r m a t f o r t h e d a t a i . e . I N T , S T R I N G . . .

v a r m a n d a t o r y : B o o l e a n = a q u a l i t y o f t h e a t t r i b u t e

v a r A t t r i b u t e D i r e c t i o n : D i r e c t i o n . V a l u e = t h e d i r e c t i o n o f t h e a t t r i b u t e

( I N , O U T , I N _ O U T )

And this is the example of an artificial service attribute indicating the “day” con-

cept:

n e w S e r v i c e A t t r i b u t e ( R . n e x t I n t . t o S t r i n g , d a t a T e m p l a t e . d a y , n e w

S e m a n t i c A n n o t a t i o n ( L i s t ( " 1 4 2 9 7 3 9 1 " ) , L i s t ( " t i m e _ p e r i o d " ) ) ,

T y p e . S T R I N G , t r u e , D i r e c t i o n . I N )

We then categorized all the fictitious services in different groups based on their

semantic values:

These groups will be then randomly used to build different access patterns and

assemble the service marts. A special consideration has to be done for “Charac-

terizing Parameters”. These are the parameters that characterize, as their name

says, the meaning of the access pattern. They are not generic parameters, as space

and time, but proper names of persons, titles of books, albums or any other spe-

cific quality a user would like to look for, like the genre of a restaurant or the name




of a team. This parameters thus are the core of the generation of fictitious service

marts.

The final corpus of Service Marts is then completely semantically annotated and

you can see this in a sample Service Mart extracted from the corpus in the schema

below.

Figure 5.9: A sample of a generated service mart data structure




5.3.5 Map sub-queries to access patterns

The last step in our analysis is to match sub-queries to appropriate service marts

that can satisfy their search requests. To do this we implemented a complex con-

frontation algorithm. As said earlier we extract from each part a datatype object

which contains different data types, either identified by regular expressions or the

entity recognizer.

The matching happens specifically between the available service attributes con-

tained in the access patterns and the data from the sub-queries. If the matchingis satisfied the service mart from which the access pattern is selected will be a

suitable candidate for the final search. All the candidates service marts are sorted

according to the domain matching score they earned during the mapping of the

sub-queries (see section 4.8).

The mapping For each type we implemented a proper mapping function. Each

type is treated separately and the mapping schema in the figure below is applied

to each DataType/AccessPattern combination. The output of the function is a list

of results containing all the data that will be needed for the following invocation

of the search service: Service Id, data, type format requested. Also the updated

DataType, with all the still available elements for matching, is given in output so

that it can be the input to the following calls to the mapping function.

Figure 5.10: Mapping schema

Since not every type is perfectly recognizable with the tools we used, we decided

to implement an enhanced comparison. We don’t compare only the attributes with

their correspondents belonging to the same type but also, as a backup matching,

with more generic types that can be a suitable match. For instance: if the service




attribute needs a “Price” type and we can’t find one in the DataType structure

extracted from the sub-query we look for a simple “number” type.

Following there is the detailed summary of every data type comparison and their

backups:

• Number and Words are confronted only with their corresponding types.

• Price: Confronted with the Price type. If not matched they are confronted

with the Number type.

• Date: The Date structure is formed by Day/Month/Year. The comparison

happens between Date types. If Day or Month types are requested singularly

from the services the matching is researched in a Date structure.

• Time: The Time structure is formed by hour/minute. The comparison hap-

pens between Time types. If Hour types are requested singularly from the

services the matching is researched in a Time structure.

• Organization, Location, Person, Title: These types are confronted with theirhomonyms respective types in the datatype structure. If no match is found

they’re confronted with the Word type doing a semantic matching.



Chapter 6

Evaluation

In this chapter we’ll evaluate the results extracted from the entries examined in

the Sift application.

6.1 Creation of the corpus of queries and service

marts

We retrieved a great number of entries from Yahoo! Answers to populate our

database and give us a big corpus to test our analyses. Over a period of time,

we collected a series of entries from Yahoo! Answers. The total number of en-

tries analyzed is close to 1200, though 10% of them had to be discarded because

completely unsuitable for our purposes. The corpus thus gives a good idea of theelements found within Yahoo’s database.

The service mart corpus we created consists of approximately 70 artificial marts

we assembled with the algorithm described in 5.3.4. These Service Mart’s annota-

tion cover a great number of domains from the WordNet Domain Hierarchy so that

the matching with words extracted from the sub-entries can be more complete.

72



CHAPTER 6. EVALUATION 73

The total number of Access Patterns is more than 400 and to every Service Mart

we assign a number of 5 to 8 of them. Each one of the Access Patterns is charac-

terized by a maximum number of 7 completely annotated service attributes.

6.2 The Experiment

After the creation of the environment, the definition of all the algorithms and

analysis to test the queries we rated every aspect manually according to the criteria

already defined in chapter 4. We evaluated the entries and the data extracted from

them with a 3-star rating system in three different categories: firstly we rated the

general quality of the entry, then the splitting applications and finally the domain

extraction. We also retrieved some data on the number of sub-entries extracted,

number of domains retrieved and service mart matched.

These ratings are automatically organized in statistics by the program so that we

can analyze more easily the results obtained.




6.3 The results and the Evaluation

6.3.1 Entries evaluation

Figure 6.1: The Main Screen of the Sift Application

Most of the entries from Yahoo! Answers we rated had a very low score and thus

they were not corresponding to our needs. A lot of questions had to be pruned

because they contained special characters, misspelled words or other nonsensepunctuation (i.e. “Want 2 find an hotel in tokio—-ASAP!!!!!!!PLS!”). The num-

ber of real multi-domain queries we extracted from our source is very low and

most of those entries are not suitable for our analysis, they don’t contain all the

data required to successfully invoke a service mart. The main reason is that a lot

of parameters are implicit in a question, taking for example this sub-entry:

“I want to find an hotel in Fiji for me and my son for the next week”




A human reading that question can extract some useful information like :

• How many persons are interested in the hotel? parent+son=2

• What is the date of the vacation? next week= date of this week + 7 days

These data is available in the question but due to the limited power of our ex-

traction methods we can’t identify them. That’s why a lot of real multi-domain

question will not have a matching service mart in our algorithm.

Moreover we have to consider the nature of the service Yahoo! Answers. This

service in fact was originally born to give people the possibility to ask questions

to other people, not specifically multi-domain ones. In fact most of the questions

we found were about advices, thoughts about something and the sort of things one

would ask to only another human being and never to an automatic service online;

for example “How’s the hotel X?I want to go there with my kids of 3 and 5 years,

will it be a good choice?” or ”I heard bad things about that neighborhood, is it

really bad? I’m moving there next week”. Both of these questions are completely

unanswerable with any automatic service, the possibility is even that the user who

asked them wants to have multiple opinions on the matter and then decide for

himself; a completely different approach to the service from what we expect from

a user approaching our multi-domain search service.

Another downside of using Yahoo! Answers is the fact that the form of the site

allows the user to give a “title” and then a specific and longer explanation of

the request. We decided to retrieve only the title section because of the useless

elements that can be found in the “text” section as links and attachments. Thisis one of the reason for the great number of low rated entries, because often the

“title”, if it’s comprehensible and structured, refers only to the main topic of the

request that in its corpus can be multi-domain.

We retrieved the approximately 1200 entries but we had to discard some of them

for the reasons stated above. We finally obtained 1064 entries. 759 of them were

rated as one star, which means that they were completely inappropriate for the




purposes of this research, so they were formally wrong or just involved a single

domain. Then we retrieved 221 two-star rated questions that are considerable as

multi-domains but not compatible with existing web services and thus are of low

value. The three-star rated entries were only 84, still a good number to test our

algorithm and analysis.

We then evaluated the entries in every aspect according to the criteria presented in

chapter four. The evaluation of the splitting and of the domain retrieval has been

done only for the entries rated with two or three stars because it would have been

nonsense to evaluate the splitting done on a single domain question.

6.3.2 Splitting Evaluation

Figure 6.2: First Level Split Statistics

Figure 6.3: Clause Level Split Statistics




From the data we extracted we can see that the “clause level split” ratings are

better than the “first level split” ones. This can be explained with a simple reflec-

tion on the tree structure of a typical sentence in the English language. The most

common structure of a standard question/sentence is subject-verb-object. A com-

plement is very likely to be connected to the last element in the principal clause,

that is the object. The same can be said for syntactical periods: the principal

clause is at the beginning of the period and it’s followed by coordinates and sub-

ordinates in the right part of the tree structure. This layout can often unbalance

the tree and make it lean on the right. This can really affect the first level split

which will result in the division of the first sentence from the rest of the period asseen in the example below.

Figure 6.4: Wrong First level Splitting

As for the number of keywords retrieved from the sub-entries we discovered that

approximately the 90% of them has less than 10 keywords from which we can

extract domains in the following tasks. The really low number of extracted key-

words in some entries can affect the information extraction but this is a limit of the

approach we decided to use. In fact due to the impossibility to extract domains




from adverbs and adjectives (they’re not in the WordNet domains database) we

had to discard them and keep only nouns and verbs.

6.3.3 Domain Extraction Evaluation

Figure 6.5: WordNet Domain Statistics

The domain extraction method that uses the WordNet Domain database is efficient

in retrieving all the possible domains a word can relate to although they usually

include far too many generic results, gotten from the secondary senses of the

words. The great number of domains retrieved for each sub-entry (as we can see

in the statistics screenshot) compels us to sort them efficiently to find the most

relevant ones and subsequently find a proper matching. The main problem in

this section is then the sorting of the domains. We used the tf-idf method as the

starting point of our research and we obtained the results listed in fig.6.5.We used

then the bonus method to improve the domain score, described in section 4.5.1,

and we can see from the results that the ratings of the sorting of the domains were

slightly better than the previous elaboration. In the following table a summary of

the statistics is presented:




Figure 6.6: WordNet Domain Statistics Optimized

Tf-Idf Optimized

1 Star 142 110

2 Stars 133 134

3 Stars 23 54

Table 6.1: Domain Extraction Results summary

The main problem found in this section is the too unbalanced and small corpus of

domains available in the database. In the WordNet Domains Hierarchy there are

less than 200 domains (the complete structure can be found in the appendix 3), an

exiguous number compared to the various exigencies of annotation of the entries

and services. Moreover the variety of domains is very unbalanced for examplethe “tourism” domain is categorized under the label “social sciences” and it’s not

detailed in any way, instead the “sport” domain is divided in 29 subcategories that

detail every possible sport discipline. This can really affect the annotation of an

entry or a service, for instance a touristic service can only count on one domain in

the annotation. This reduces a lot its semantic potential in the matching process.




6.4 Service Mart Matching Evaluation

Figure 6.7: Service Mart Matching Statistics

From the auto generated statistics we can see that 726 sub-entries have at least

one service mart that matches their data. This result can be considered successful

for the technical aspects of our matching algorithm. A good number of queries

is successfully matched to a service mart that is therefore invokable with everyrequired service parameter. Despite the correctness of the algorithm we can’t

say anything on the semantic correctness and effectiveness of the matching since

we don’t have the actual results but only the name and description of a fictitious

service mart.

6.5 A Complete example of info extraction, splitting

and matching.

To examine properly every detail of the analysis and processing of an entry we

chose a multi-domain question that gave good results in almost every section.

The original question in input:




D o y o u k n o w a f i v e s t a r r a t e d T h a i o r J a p a n e s e r e s t a u r a n t

i n L A o n 0 3 / 2 3 / 2 0 1 0 n i g h t ? T h e n a n e a r m o v i e t h e a t e r w i t h

a n A m e r i c a n f i v e s t a r r a t e d c o m e d y m o v i e ? a f t e r I w a n t 3

n i g h t s i n a f i v e s t a r M a l i b u h o t e l f r o m 0 3 / 2 2 / 2 0 1 0 t o 0 3 / 2 4 / 2 0 1 0 ?

This is the splitting in clauses for the entry.

Figure 6.8: The Clause split of the sample entry

Below you can see the trees of the entry. As you can see every sub-entry is iden-

tified by a different tree, this is due to the punctuation and structure of the entry

that is composed by three questions.

Figure 6.9: The trees of the clause split division




Sub-Entries

DataTypes I II III

Dates 23/03/2010 22/03/2010,24/03/2010

Words star, Thai,

Japanese,

restaurant, LA,

night

movie, theater,

American,

star, comedy, movie

nights,star,hotel

Locations Japanese,LA Malibu

Numbers 3

Table 6.2: The Data Types extracted

These are the data types extracted from each sub-entry:

These are the matching results for each sub-entry:

Figure 6.10: Matching for the sub-entries

In fig 6.10 you can see the results of the matching algorithm. For each sub-entry

we selected the service marts with the highest matching score in terms of domain

affinity. Then we checked for the availability of their required parameters in the




sub-entries and choose the service mart with the highest score that can also be in-

vokable. In the results we thus indicated the sub-entry (in bold black), the service

mart Id and its domain matching score, the matched AP Id (at the bottom) and

all the service attribute Ids with the corresponding data from the sub-entry. To

check the type requirements for every service mart with the type extracted from

the sub-entries we operate manually. We found that in the considered case they

were completely corresponding, meaning the algorithm found perfect match and

didn’t have to resolve to the backup matching listed in 5.3.5.

In the following screen we can see the Ids and scores of the service marts invok-

able by the sub-entries. On the left, under each sub-entry, you can see the service

mart Ids and on the right, in green, the domain matching score for each of them.

Figure 6.11: Service Mart matching for the sub-entries



Chapter 7

Conclusions

7.1 Objectives and Final Evaluation

This thesis project objective was the research and creation of a matching service

that could help the pairing of the natural language queries to the most suitablesearch services, a long and boring task if operated by hand by a single user. In the

research process we first had to enhance and enrich the analysis environment and

implement some autostatistics tools. Using them we validated some techniques

that were previously used to split and retrieve domains and we researched, tested

and evaluated some new approaches for the domain extraction sorting. Then we

extracted, through the use of different tools, the data information from the sub-

queries. Combining the data and domain information we then developed a new

task for the matching with services. Finally we presented the results obtained and

validated the approaches used. The final application presents the complete process

of acquisition, analysis and matching of the entries.

The results obtained from the program indicate that the approaches used were

quite successful in the technical aspects as we explained in the evaluation section

and are a strong base for future testing and development of the tasks researched.

84



CHAPTER 7. CONCLUSIONS 85

7.2 Future Works

The natural language elaboration and analysis process is a very tricky and difficult

one. Even if we tried successfully some algorithms and analysis in this field,

we can be sure there is still a lot of work to be done to make this analysis tool

completely efficient. The analysis of the input and the matching of it to suitable

search services will be the core activity of the SeCo project application and with

this thesis we structured an initial complete approach to the problem that can be

developed and extended.

Splitting This option has been explored through the Stanford Parser approach,

we just considered the structure of the sentence and never the meaning of each

clause, a more in depth research can be done in this field exploiting the typical

structures and patterns of questions to split more efficiently the entries.

Domain Extraction In this section the main downside of the tool we used is the

scarcity and little variety of domains. One possible solution to the problem is to

enhance the available domain database in some way, because the reannotation of

the entire WordNet database would be a too ambitious task if not impossible.

Information Extraction This section can be enhanced with research on richer

regular expressions or entity extraction tools can be tried using the existing pro-

gram structure. In particular more advanced entity extraction tools can be useful

for the identification of different kind of subjects in the sub-entry that can be help-

ful in the domain retrieval section.

Service Mart Matching The algorithm proposed in this thesis project is a com-

plete although a basic one. Other approaches to the problem can be tried using for

instance data mining techniques if we get a good number of annotated services and

queries. The main problem at the moment is the absence of real testable services



CHAPTER 7. CONCLUSIONS 86

that could give us real results on which we could base our evaluation. Also the

multiple domain questions retrieved are a small number for a good and complete

testing so in the future we would need a bigger corpus to validate the techniques

used.



Chapter 8

Appendix

Appendix I - Glossary

Entry/Query - The question entered in the system manually or retrieved from

Yahoo! Answers.

Sub-Entry/Sub-Query - The various sections of the sentence retrieved after the

splitting process

Domain - Conceptual entity that defines the meaning of a word. The only domains

used are the ones belonging to the WordNet Domains Hierarchy.

Multi-Domain - Referred to a query that involves more than one domain in its

subject.

Appendix II - ServiceMart XML

87

CHAPTER 8. APPENDIX 88

< S e r v i c e M a r t >

< ! - - M a r t ' s g l o b a l p r o p e r t i e s - - >

 - 9 7 6 4 0 3 9 4 4 

< N a m e / >

< D e s c r i p t i o n / >

< S e m a n t i c A n n o t a t i o n >

< ! - - L i s t o f s y n s e t s a n d d o m a i n s - - >

< s y n s e t s >

< l i > 2 2 6 2 5 < / l i >

< l i > 1 4 2 9 7 3 9 1 < / l i >

< l i > 1 4 3 4 8 1 5 6 < / l i >

< l i > 5 9 4 3 4 8 0 < / l i >

< l i > 4 1 6 7 5 6 1 < / l i >

< l i > 4 8 3 6 1 7 4 < / l i >

< l i > 1 4 3 0 1 4 3 2 < / l i >

< l i > 5 9 5 0 5 0 5 < / l i >

< l i > 4 2 4 7 3 5 5 < / l i >

< l i > 6 2 0 5 4 5 2 < / l i >

< l i > 1 4 3 4 3 0 1 9 < / l i >

< l i > 1 4 3 6 6 7 1 7 < / l i >

< l i > 1 4 3 7 3 5 7 1 < / l i >

< l i > 4 8 0 7 1 8 0 < / l i >

< l i > 5 4 0 3 5 1 8 < / l i >

< l i > 8 0 0 5 4 0 7 < / l i >

< l i > 8 0 2 3 6 6 8 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > a r t < / l i >

< l i > c i n e m a < / l i >

< l i > g e o g r a p h y < / l i >

< l i > q u a l i t y < / l i >

< l i > t h e a t r e < / l i >

< l i > t i m e _ p e r i o d < / l i >

< l i > t v < / l i >

< / d o m a i n s >

< / S e m a n t i c A n n o t a t i o n >

< ! - - T h e f i r s t A c c e s s P a t t e r n - - >

< S M P a t t e r n >

< A c c e s s P a t t e r n >

 - 1 3 9 0 4 3 1 3 6 0 

< N a m e / >


< ! - - L i s t o f A t t r i b u t e s o f t h e A P - - >

< S e r v i c e A t t r i b u t e >

 - 1 8 7 3 9 6 7 5 9 4 

< D a t a T e m p > 1 3 < / D a t a T e m p >


< s y n s e t s >

< l i > 5 9 5 0 5 0 5 < / l i >

< l i > 6 2 0 5 4 5 2 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > a r t < / l i >

< l i > t v < / l i >

< l i > c i n e m a < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >

< m a n d a t o r y > t r u e < / m a n d a t o r y >

< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e

D i r e c t i o n >

< / S e r v i c e A t t r i b u t e >


 6 2 6 1 7 8 8 0 6 



< s y n s e t s >

< l i > 4 2 4 7 3 5 5 < / l i >

< l i > 6 2 0 5 4 5 2 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > a r t < / l i >

< l i > t h e a t r e < / l i >

< l i > c i n e m a < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >



 - 2 1 4 5 3 0 7 5 2 

< D a t a T e m p > 3 < / D a t a T e m p >


< s y n s e t s >

< l i > 1 4 3 4 3 0 1 9 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > t i m e _ p e r i o d < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >



 - 4 9 1 8 8 9 1 5 3 



< s y n s e t s >

< l i > 1 4 3 6 6 7 1 7 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > t i m e _ p e r i o d < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >

1 4 0 5 8 2 4 4 6 5 



< s y n s e t s >

< l i > 1 4 3 7 3 5 7 1 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > t i m e _ p e r i o d < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >



 - 8 7 5 6 1 5 3 3 



< s y n s e t s >

< l i > 4 8 0 7 1 8 0 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > g e o g r a p h y < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >



 - 1 9 2 5 6 2 5 9 8 0 



< s y n s e t s >

< l i > 8 0 2 3 6 6 8 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > g e o g r a p h y < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >



 - 4 5 2 9 0 8 4 9 



< s y n s e t s >

< l i > 2 2 6 2 5 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > g e o g r a p h y < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >



 - 4 5 2 9 0 8 4 9 



< s y n s e t s >

< l i > 2 2 6 2 5 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > g e o g r a p h y < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >


< / A c c e s s P a t t e r n >

< / S M P a t t e r n >

< A P >

< A c c e s s P a t t e r n >

 - 1 3 9 0 4 3 1 3 6 0 

< N a m e / >



 - 1 8 7 3 9 6 7 5 9 4 



< s y n s e t s >

< l i > 5 9 5 0 5 0 5 < / l i >

< l i > 6 2 0 5 4 5 2 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > a r t < / l i >

< l i > t v < / l i >

< l i > c i n e m a < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >



 6 2 6 1 7 8 8 0 6

< s y n s e t s >

< l i > 4 2 4 7 3 5 5 < / l i >

< l i > 6 2 0 5 4 5 2 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > a r t < / l i >

< l i > t h e a t r e < / l i >

< l i > c i n e m a < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >



 - 2 1 4 5 3 0 7 5 2 



< s y n s e t s >

< l i > 1 4 3 4 3 0 1 9 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > t i m e _ p e r i o d < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >



 - 4 9 1 8 8 9 1 5 3 



< s y n s e t s >

< l i > 1 4 3 6 6 7 1 7 < / l i >

< / s y n s e t s >

< d o m a i n s >

< l i > t i m e _ p e r i o d < / l i >

< / d o m a i n s >


< T y p > 3 < / T y p >



D i r e c t i o n >


< / A c c e s s P a t t e r n >

< / A P >

< / S e r v i c e M a r t >




Appendix III - WordNet Hierarchy

TOP LEVEL

doctrines

free_time

applied_science

pure_science

social_science

factotum

number

color

time_period

personquality

metrology

HIERARCHY: DOCTRINES

archaeology

astrology

history

• heraldry

linguistics

• grammar

literature

• philology

philosophy

psychology

• psychoanalysis

art

• dance

• drawing

– painting

– philately

• music

• photography

• plastic_arts

– jewellery

– numismatics

– sculpture

• theatre

religion

• mythology

• occultism

• theology

archaeology

HIERARCHY: FREE_TIME

free_time

play

• betting

• card

• chess

sport

• badminton

• baseball

• basketball

• cricket

• football

• golf

• rugby

• soccer

• table_tennis

• tennis

• volleyball

• cycling

• skating

• skiing

• hockey

• mountaineering

• rowing

• swimming

• sub

• diving

• athletics

• wrestling

• boxing

• fencing

• archery

• fishing

• hunting

• bowling

• racing

HIERARCHY:

APPLIED_SCIENCE

applied_science

agriculture

alimentation

• gastronomy

architecture

• town_planning

• building_industry

• furniture

computer_science

engineering

• mechanics

– astronautics

– electrotechnics

– hydraulics




medicine

• dentistry

• pharmacy

• psychiatry

• radiology

• surgery

veterinary

• zootechnics

HIERARCHY: PURE_SCIENCE

astronomy

• topography

biology

• biochemistry

• ecology

• botany

zoology

• entomology

anatomy

physiology

genetics

chemistry

earth

• geology

• meteorology

• oceanography

• paleontology

• geography

mathematics

• geometry

physics

• acoustics

• atomic_physic

• electricity

• optics

HIERARCHY:

SOCIAL_SCIENCE

administration

anthropology

• ethnology

– folklore

artisanship

body_care

commerce

economy

• banking

• book_keeping

• enterprise

• exchange

• insurance

• money

• tax

fashion

industry

law

• state

military

pedagogy

• school

• university

politics

• diplomacy

publishing

sexuality

sociology

telecommunication

• cinema

• post

• radio

• telegraphy

• telephony

• tv

tourism

transport

• aeronautic

• auto

• merchant_navy

• railway



Bibliography

[1] Luisa Bentivogli, Pamela Forner, Bernardo Magnini and Emanuele Pianta.

"Revising WordNet Domains Hierarchy: Semantics, Coverage, and Balanc-

ing". In Proceedings of COLING 2004 Workshop on "Multilingual Linguis-

tic Resources", Geneva, Switzerland, August 28, 2004, pp. 101-108.

[2] Jenny Rose Finkel , Trond Grenager , Christopher Manning, Incorporating

non-local information into information extraction systems by Gibbs sam-

pling, Proceedings of the 43rd Annual Meeting on Association for Compu-

tational Linguistics, p.363-370, June 25-30, 2005, Ann Arbor, Michigan

[3] Erik F. Tjong Kim Sang , Fien De Meulder, Introduction to the CoNLL-2003

shared task: language-independent named entity recognition, Proceedings of

the seventh conference on Natural language learning at HLT-NAACL 2003,

p.142-147, May 31, 2003, Edmonton, Canada

[4] R. Mandala, T. Takenobu, and T. Hozumi. The use of Word.Net in informa-

tion retrieval. In COLING/ACL Workshop on Usage of WordNet in Natural

Language Processing, Systert, 1998.

[5] Magnini, B. and C. Strapparava. 2000. Experiments in word domain disam-

biguation for parallel texts. In ACL-2000 Workshop on Word Sense and Mul-

tilinguality. Association for Computational Linguistics, New Brunswick, NJ.

[6] Unsupervised Domain Relevance Estimation for Word Sense Disambigua-

tion Alfio Gliozzo and Bernardo Magnini and Carlo Strapparava ITC-irst,

Istituto per la Ricerca Scientifica e Tecnologica, I-38050 Trento, ITALY

93



BIBLIOGRAPHY 94

[7] B. Magnini, C. Strapparava, G. Pezzulo, A. Gliozzo. "The Role of Domain

Information in Word Sense Disambiguation", Natural Language Engineer-

ing, special issue on Word Sense Disambiguation, 8(4), pp. 359-373, Cam-

bridge University Press, 2002

[8] E. Minkov, R. C. Wang, and W. W. Cohen. Extracting personal names

from emails: Applying named entity recognition to informal text. In HLT-

EMNLP, 2005.

[9] Li, Y.; Moffat, A.; Stokes, N. & Cavedon, L. Exploring Probabilistic To-

ponym Resolution for Geographical Information Retrieval. In 3rd Workshop

on Geographic Information Retrieval (GIR 2006). Seattle, WA,USA, 2006.

17–22.

[10] Karen Spärck Jones. A statistical interpretation of term specificity and its

application in retrieval. Journal of Documentation, 28:11–21, 1972.

[11] Fox, E.A., Neves, F.D., Yu, X., Shen, R., Kim, S. and Fan, W. Exploring

the computing literature with visualization and stepping stones & pathways.

CACM 49(4): 52-58, 2006.

[12] C. Zhang, N. Sun, X. Hu, T. Huang, and T.-S. Chua. Query segmentation

based on eigenspace similarity. In Proceedings of the ACL-IJCNLP 2009

Conference, pages 185–188, Suntec, Singapore, August 2009.

[13] Bin Tan , Fuchun Peng, Unsupervised query segmentation using generative

language models and wikipedia, Proceeding of the 17th international confer-

ence on World Wide Web, April 21-25, 2008, Beijing, China

matching natural language multi domain queries to search service

Documents