matching natural language multi domain queries to search service
TRANSCRIPT
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 1/96
POLITECNICO DI MILANO
FACOLTÁ DI INGEGNERIA
CORSO DI LAUREA SPECIALISTICA IN INGEGNERIA INFORMATICA
MATCHING NATURAL LANGUAGEMULTIDOMAIN QUERIES TO SEARCH SERVICES
Relatore: Ing. Marco BRAMBILLA
Correlatore: Prof. Stefano CERI
Tesi di Laurea Specialistica di:
Claudia Farè
Matricola n. 721154
ANN O ACCADEMICO 2008-2009
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 2/96
The computer was born to solve problems that did not exist before.
Bill Gates
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 3/96
Contents
1 Introduction 8
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Background Work 11
2.1 SeCo, beyond Page Search . . . . . . . . . . . . . . . . . . . . . 11
2.2 The General Architecture . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 The registration flow . . . . . . . . . . . . . . . . . . . . 14
2.2.2 The query execution flow . . . . . . . . . . . . . . . . . . 14
2.2.2.1 Query analysis . . . . . . . . . . . . . . . . . . 15
2.2.2.2 Query to domain and service mapping . . . . . 15
2.2.2.3 Query Planner . . . . . . . . . . . . . . . . . . 16
2.2.2.4 Query engine . . . . . . . . . . . . . . . . . . 17
2.2.2.5 Result transformation and Interfaces . . . . . . 17
1
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 4/96
CONTENTS 2
2.3 Service Marts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 The Natural Language Framework . . . . . . . . . . . . . . . . . 19
2.5 WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 WordNet Domains . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Stanford Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Name Entity Recognition . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Related Work 29
3.1 WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 WordNet Domains . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Name Entity Recognizer . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Query splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 The Thesis Project Contribution 34
4.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Query Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 The parsing . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 The splitting . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 The Extraction of the Data Types . . . . . . . . . . . . . . . . . . 40
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 5/96
CONTENTS 3
4.5 Mapping to domains . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Methods to improve the domain score . . . . . . . . . . . 43
4.6 The Service Mart Repository . . . . . . . . . . . . . . . . . . . . 44
4.7 Map sub-queries to Service Marts . . . . . . . . . . . . . . . . . 45
4.8 Map sub-queries to Access Patterns . . . . . . . . . . . . . . . . 45
4.8.1 The semantic name matching . . . . . . . . . . . . . . . . 46
4.8.2 Evaluation Criteria and Statistics . . . . . . . . . . . . . . 47
5 Implementation 52
5.1 The system general architecture . . . . . . . . . . . . . . . . . . 52
5.2 The Sift Application . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Bee - Distributed Background Processing . . . . . . . . . 56
5.3 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.2 Sentence Splitting Strategies . . . . . . . . . . . . . . . . 60
5.3.3 Information extraction . . . . . . . . . . . . . . . . . . . 62
5.3.4 Service Mart Semi-Automatic Generation . . . . . . . . . 67
5.3.5 Map sub-queries to access patterns . . . . . . . . . . . . . 70
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 6/96
CONTENTS 4
6 Evaluation 72
6.1 Creation of the corpus of queries and service marts . . . . . . . . 72
6.2 The Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 The results and the Evaluation . . . . . . . . . . . . . . . . . . . 74
6.3.1 Entries evaluation . . . . . . . . . . . . . . . . . . . . . . 74
6.3.2 Splitting Evaluation . . . . . . . . . . . . . . . . . . . . 76
6.3.3 Domain Extraction Evaluation . . . . . . . . . . . . . . . 78
6.4 Service Mart Matching Evaluation . . . . . . . . . . . . . . . . . 80
6.5 A Complete example of info extraction, splitting and matching. . . 80
7 Conclusions 84
7.1 Objectives and Final Evaluation . . . . . . . . . . . . . . . . . . 84
7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8 Appendix 87
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 7/96
List of Figures
2.1 The overall architecture of the system, together with the two main
execution flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 The research flows for the Natural Language Framework . . . . . 19
2.3 The sample pie chart . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 The trees retrieved from the analysis of the query . . . . . . . . . 35
4.2 The tree for the correct example . . . . . . . . . . . . . . . . . . 36
4.3 The semantic modelization of the Service Mart . . . . . . . . . . 38
4.4 The WordNet Domains Hierarchy . . . . . . . . . . . . . . . . . 42
4.5 Example of sub-query/AP matching . . . . . . . . . . . . . . . . 46
4.6 Example statistics first level split . . . . . . . . . . . . . . . . . . 49
4.7 Example Statistics for clausesplit . . . . . . . . . . . . . . . . . . 50
4.8 Example statistics for domains extraction . . . . . . . . . . . . . 51
5.1 The Architecture schema . . . . . . . . . . . . . . . . . . . . . . 52
5.2 The models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 8/96
LIST OF FIGURES 6
5.3 Screen of the sift application . . . . . . . . . . . . . . . . . . . . 55
5.4 The Bee Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 The algorithm schema after the splitting . . . . . . . . . . . . . . 62
5.6 The Information Extraction Flow . . . . . . . . . . . . . . . . . . 62
5.7 Domain extraction algorithm structure . . . . . . . . . . . . . . . 64
5.8 A graphical sample of a substructure of the WordNet hierarchy . . 65
5.9 A sample of a generated service mart data structure . . . . . . . . 69
5.10 Mapping schema . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 The Main Screen of the Sift Application . . . . . . . . . . . . . . 74
6.2 First Level Split Statistics . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Clause Level Split Statistics . . . . . . . . . . . . . . . . . . . . 76
6.4 Wrong First level Splitting . . . . . . . . . . . . . . . . . . . . . 77
6.5 WordNet Domain Statistics . . . . . . . . . . . . . . . . . . . . . 78
6.6 WordNet Domain Statistics Optimized . . . . . . . . . . . . . . . 79
6.7 Service Mart Matching Statistics . . . . . . . . . . . . . . . . . . 80
6.8 The Clause split of the sample entry . . . . . . . . . . . . . . . . 81
6.9 The trees of the clause split division . . . . . . . . . . . . . . . . 81
6.10 Matching for the sub-entries . . . . . . . . . . . . . . . . . . . . 82
6.11 Service Mart matching for the sub-entries . . . . . . . . . . . . . 83
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 9/96
List of Tables
5.1 The task interface . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Bonus Refinement example . . . . . . . . . . . . . . . . . . . . . 66
5.3 List of Groups for the Service Mart Generation . . . . . . . . . . 68
6.1 Domain Extraction Results summary . . . . . . . . . . . . . . . . 79
6.2 The Data Types extracted . . . . . . . . . . . . . . . . . . . . . . 82
7
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 10/96
Chapter 1
Introduction
1.1 Context
In the last years a lot of efforts have been spent on the research in information re-
trieval either over the subject of full-text search or document indexing. The mainfruits of these efforts have been the general purpose search engines everyone of us
uses in their life like Yahoo™ and Google™, the former has even become a proper
verb in the English language given the popularity of the term. These engines gives
us the possibility to retrieve any document available on the web about the topic we
are searching for. If with the World Wide Web the democratization of information
availability began, with these search engines it grew to a maximum. However
this type of simple but wide search brought along some limitations. Users don’t
want to look for generic documents about a topic anymore, they want answers to
specific questions as the search engine were a human being that understood their
needs and satisfied them. In order to look for an answer, with a general purpose
engine as Google™, users usually have to hope to find a document where someone
has already asked that question or rely on the reading of a number of documents
hoping to find what they were looking for. A lot of researches and projects ex-
plored this field and one notable effort is the one represented by Knowledge Based
search systems. These systems allow the user to ask for a specific question based
8
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 11/96
CHAPTER 1. INTRODUCTION 9
on a knowledge base built on large ontologies that can select the right answers.
This is very efficient for “non-changing” information such as technical, mathe-
matical, geographical and physics questions. Instead it’s really unreliable for ever
changing data like news and events. Moreover the number of domains that the
request can involve is restricted to one, only specific question about one topic at
a time can be asked. So the object of the future research is to lift the limitation
of the single domains questions and to provide results not only about precise facts
but also about questions where the answers can involve more domains with possi-
ble rankings based on features. For example the question “I want a cheap Chinese
restaurant near piazza Duomo in Milan” involves two domains “place” and “Chi-nese restaurants” and it requires a ranking based on the price. In the last years
there has been an increase in popularity of the web services. These services of-
fer a software interface which allows other systems to interact with them through
the HTTP protocol. The proliferation of open and accessible web search services
has allowed the world to access, aggregate and mix data in previously unthought
ways. From this prerogatives the SeCo project at Politecnico di Milano was born.
This project is currently under active development and it aims at building a system
that pushes the boundaries of the current search engines.
1.2 The Problem
Although many discoveries have been made in theoretical and formal aspects of
distributing multi-domain queries and merging back the results, a lot of work still
has to be done in the subject of interfacing the system with the user in the most
natural way. Usually interfaces for such services are complex and have to be set
manually sometimes with a very little user friendly syntax. Instead services as
Yahoo!™ or Google™ have popularized the simple text box where free text can
be entered; the filtering and understanding is completely left to the service while
the user writes as he would write to another understanding entity. This is the main
problem that led us to the current project of research in the field of query analysis,
specifically oriented towards understanding, translating and matching with the
right services the queries made in the SeCo project and that can span more than
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 12/96
CHAPTER 1. INTRODUCTION 10
one domain. The answering of multi-domain questions in a non-automated way
is a complex and boring job for users because they need to coordinate the answers
from various services. If we can examine and extract multi-domain information
from a single query and match the elements of the query directly with the right
services, this will be a great step toward a fully functioning multi-domain search
engine.
1.3 Objective
The main objective of this thesis project is to set up an analysis and matching en-
vironment for the natural language multi-domain queries and examine the results
retrieved with the tools experimented. This environment is based on the Sift ap-
plication and through the use of many splitting and information extractions tools
will allow us to examine the entries and match them to suitable web search ser-
vices or service marts. In the information extraction part we will then translate the
input questions, from the natural way as a user would input them, into a form the
system can understand and act upon. In the matching section we’ll try to match
the given queries to suitable service marts that will then be the starting point of
the information retrieval process.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 13/96
Chapter 2
Background Work
2.1 SeCo, beyond Page Search
In the last few years Internet search has been performed mainly routing users to
the web page that best answered the question they submitted. Typically the page
search services that are available online are categorized in three main kinds.
The general purpose search engines: the most popular and widely used, as Google™
and Bing™, base their searches on relevance and ranking indexes that are updated
depending on the importance and the popularity of each web page. These search
engines are the most popular because of their ability to fulfill user needs; how-
ever not all the information requests can be satisfied by web pages (the so called
“surface web”). Most of the information available on the Internet is in the “deep
web”, this expression refers to all the dynamically generated sites, whose contentcan’t be accessed through search engines crawlers.
Another kind of search technology are the knowledge based search systems. They
base their searches on large previously built ontologies that select the right answer
to the question. With this approach, the wider the ontology is, the more effective
the results will be. This method is superior to conventional search to answer
queries over well-structured or organized knowledge. The downside is that such
11
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 14/96
CHAPTER 2. BACKGROUND WORK 12
wide ontologies require long developing times and great efforts in updating the
knowledge base, with no possibility to add dynamical or ever changing data like
weekly events or news.
The third approach are Meta-Search engines, these engines combine the results
about a single domain request in a way that would have taken hours for a simple
user to achieve using only the generalist engines. For instance a meta-search
engine can provide a list ordered by price of the flights between two cities in a
few seconds, a task that otherwise would have required a simple user the lengthy
visit to companies and travel agencies sites. The main downside of these search
engines is the single domain limit.
With none of the proposed approaches one is able to reach a multi-domain answer
in a single search.
These kind of searches, though very effective for a large number of queries, don’t
support multi-domain requests. If a multi-domain question is given to them the
result will be very likely unsatisfactory, unless the same combination of multi-
domain data can be found on an existing web page.
The SeCo project aims at the creation of a multi-domain search system based
on web services. This platform aims at pushing the limits of the field of multi-
domain queries by formalizing theoretical aspects as well as providing a software
engineering point of view, enabling the construction of a usable search engine
that will answer arbitrary queries.These queries will be analyzed and matched to
suitable web services, the results will be finally aggregated and the user will be
able to visualize multiple domain results in response to a single request.
2.2 The General Architecture
The SeCo project is divided in different higher-level components composed in a
service oriented manner. Within the multi-domain query answering problem the
SeCo architecture can be divided in two main activity flows: the registration flow
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 15/96
CHAPTER 2. BACKGROUND WORK 13
- that deals with the creation of new domains, domain descriptions, and search
services within the framework; the query execution flow - that deals with the
actual enactment of the queries. The main components are the query analysis,
the query-to-domain mapper, the query planner, the query engine and the results
transformation. Two frameworks named the service and domain frameworks are
also added as intelligent repositories.
In the query execution flow a query sent from the user first passes through the
query analysis and the query-to-domain mapper, where the different domains and
properties are extracted from the natural language query. It then goes to the query
planner, which creates an execution plan taking into accounts the different costs
associated to executing the query, in order to create the most efficient execution.
The different sub-queries are then sent to the domain and service frameworks,
which take care of calling the external services through a Web or messaging in-
terface. The results are then collected, and, according to the plan, merged back
together. The final results are then transformed before being sent back to the user.
While the query execution flow interests all the analysis from a request of the end
user to its response from the system, the activity in the registration flow interests
mainly the registration of search services by service designers or other developers.
Figure 2.1: The overall architecture of the system, together with the two main
execution flows.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 16/96
CHAPTER 2. BACKGROUND WORK 14
2.2.1 The registration flow
The registration flow comprises all the activities that deal with the registration
of new domains, domain descriptions and search services. This section will be
briefly explained because it doesn’t interest directly the thesis project.
The domain framework deals with domains and their definitions and addresses the
problems of semantic annotation, storage, management, and access to domains
and their descriptions. On the concept of domain is based all of the multi-domain
search engine. A domain is considered as a self-standing field of interest for
the user. The domain repository is a data structure that is able to store domains
organized as a taxonomy, representing a tree of domain/sub-domain relationships.
Information about the domains can be retrieved by other components through an
API.
The search service framework defines a conceptual model of search service and
addresses the semantic annotation, storage, management, and access to search
services. Its main function is to enable the annotation of the request/response
interface of the services. Such annotation uses the WordNet vocabulary and addslabels to each service, its operations, and the input-output parameters of each
operation. The framework is concerned only with those operations belonging to
a Web service which perform data retrieval, particularly with operations which
return itemized and ranked information.
The service analyzer addresses the following problems: the clustering of the avail-
able services, based on their similarity, the mapping of services to domains and
the definition of join connections between services.
2.2.2 The query execution flow
Along the query execution flow we address the following problems:
The main components are the query analysis, the query-to-domain mapper, the
query planner, the query engine and the results transformation. A query sent from
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 17/96
CHAPTER 2. BACKGROUND WORK 15
the user first passes through the query analysis and the query-to-domain mapper,
where the different domains and properties are extracted from the natural lan-
guage query. It then goes to the query planner, which creates an execution plan
taking into accounts the different costs associated to executing the query, in order
to create the most efficient execution. The different sub-queries are then sent to
the domain and service frameworks, which take care of calling the external ser-
vices through a Web or messaging interface. The results are then collected and,
according to the plan, merged back together. The final results are then transformed
before being sent back to the user.
2.2.2.1 Query analysis
In this section high level multi-domain user queries are analyzed and a splitting
into sub-queries is made. A high level query is the specification of an information
need of a user at a high level of abstraction. It’s assumed that high level queries
are quasi-natural language descriptions of the user request which may require to
extract information from multiple domains. The query analysis component de-
composes the high-level queries into sub-queries, each representing one search
objective in a specific domain. For processing the natural language query, an
open source tool developed by the Stanford Natural Language Processing Group
is used.
2.2.2.2 Query to domain and service mapping
This component addresses the problems of mapping of sub-queries to domainsand subsequently to associated search services, at the purpose of defining low-
level queries. To successfully map a sub-query to a domain we need to retrieve
for each of them a defined subset of similar domains which allow a crisp iden-
tification of the sub-query semantic, that due to the use of natural language can
be ambiguous and imprecise. Several techniques can be applied to optimize the
recognition of query/sub-query structures which comply with the separation into
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 18/96
CHAPTER 2. BACKGROUND WORK 16
distinct domains of concern so as to achieve the objective; some of these methods
will be analyzed in their meaning and implementation in the next chapters.
2.2.2.3 Query Planner
A low-level query is a composite query over a number of services. The query
planner is a well-defined scheduling of service invocations, possibly parallelized,
that complies with their access modes and exploits the ranking order in which
search services return results to rank the results. The Query Planner addressesthe problem of generating query plans and evaluating them against a cost metric
so as to choose the most promising one for execution. It accepts as input low-
level queries, i.e. conjunctive queries that list the specific services to be invoked,
already chosen by the Query-To-Domain Mapper. Then it schedules the invoca-
tions of Web services and the composition of their inputs and outputs. In the end
it progressively refines choices and produces an access plan by performing the
following steps:
1.Given that services may be accessed according to different patterns, the Query
Planner chooses specific access patterns for each of the services involved in the
query, provided that they are compatible with the query.
2.Once the access patterns are fixed, there may still be some indeterminacy on the
order of invocation of the different services, some of which may be invoked in
parallel. The Query Planner fixes such order.
3.The main operation for combining search services in our conjunctive setting is
the join. The Query Planner selects an execution strategy for each join.
4.Optimality of execution primarily depends upon the cost and time of execu-
tion of request/responses to services. The Query Planner determines the expected
number of requests associated with each service request in order to obtain the
desired number of results, so as to associate to each plan an execution cost.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 19/96
CHAPTER 2. BACKGROUND WORK 17
2.2.2.4 Query engine
This component takes the low-level plan from the query planner and executes the
different service calls in parallel, merging and ordering when required. It will
return the final results of the query in a pure and internal format as they become
available, sending them to the results transformation component for their final
processing.
The query engine deals with the generation and processing of query execution
schedules: it takes the low-level plan from the query planner and executes thedifferent service calls in parallel, merging and ordering when required. The results
generated and the combinations returned are collected in their “raw” format of
tuples of values, and passed to the Result Transformation module, to be processed
in order to be presented to the user.
2.2.2.5 Result transformation and Interfaces
This component is dedicated to the definition of proper interfaces for submission
of multi-domain user queries and transformation of the results in the format re-
quested by the final user. It deals with: building a interface for the user to express
multi-domain queries in a facilitated way and building an interface for presenting
results. In the former the user can drill down the result set and understand where
each piece of information comes from, enabling query refinement, or can peruse
the results of past queries to better reformulate his information need.
2.3 Service Marts
The Service Mart component is an abstraction used to manage the publication
and handle the data sources in the Search Computing architecture. The goal of
a service mart is to ease the publication of a special class of software services,
called search services, whose responses are ranked lists of objects. Every service
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 20/96
CHAPTER 2. BACKGROUND WORK 18
mart is mapped to one "Web object" available on Internet; therefore, we may have
service marts for “hotels”, “flights”, “doctors”, and so on. Thus, service marts
are consistent with a view of the "Internet of objects" which is gaining popularity
as a new way to reinterpret concept organization in the Web and go beyond the
unstructured organization of Web pages.
A Service Mart is a component with a known interface defined at project time
which manages a collection of similar or semantically correlated services. The
Service Mart can invoke these services presenting itself as a standard interface
between the request from a query and its result. The underlying complexity can
be then hidden to the higher level and the result can be a completely relational
model, simplified w.r.t. the original complexity of the web services model.
A Service Mart is defined by an Id, a Name and a Description which documents
its functionalities. It’s then divided on different levels of abstraction: the highest
level is the Service Mart Signature, it contains a description of the service mart
attributes that are the sample input and output data that the Mart can handle and
repeating groups consisting of a non-empty set of sub-attributes that collectively
define a property of the service mart. In the underlying level there are the AccessPatterns. Their structure is analogue to the Signature and they specify an ulterior
possible invocation mode. Each parameter in an Access Pattern is identified by a
data type, a “mandatory” flag and a direction (input or output). At the third lower
level there are the Service Interfaces. A service Interface is a concrete description
of an access pattern, it has an interface with its attributes and it’s linked to a service
implementation, the real link to the web service (to retrieve data from local or
remote sources).
Connection patterns represent the coupling of service marts (at the conceptual
level) and of service interfaces (at the physical level). Each pattern has a con-
ceptual name and then a logical specification, consisting of a sequence of simple
comparison predicates between pairs of attributes or sub-attributes of the two ser-
vices, which are interpreted as a conjunctive Boolean expression, and therefore
can be implemented by joining the results returned by calling service implemen-
tations.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 21/96
CHAPTER 2. BACKGROUND WORK 19
Visually, service marts and connection patterns can be presented as resource graphs,
where nodes represent marts and undirected arcs represents connection patterns.
The model of the web proposed by Search Computing is based on a simplifica-
tion of reality, which is seen through potentially very large resource graphs. This
visualization enables the linking of interconnected concepts which support the
creation of multi-domain queries through ad-hoc user interfaces.
2.4 The Natural Language Framework
The natural language processing framework used as a starting point for the thesis
was the fruit of a double phase research as illustrated in the figure, the main goal
of the design of this framework was to assemble a complete corpus of queries and
analyze them efficiently so that the data retrieved could be the starting point of the
testing of the SeCo search engine.
Figure 2.2: The research flows for the Natural Language Framework
The aim of the framework is to create an environment to analyze input queries and
extract information about their characteristics and domains, subsequently these
information will be used for the elaboration of a suitable matching with corre-
sponding search services.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 22/96
CHAPTER 2. BACKGROUND WORK 20
In the first phase, a corpus that responds to the needs of the project, that is, assem-
bled of as many multi-domain queries as possible, was created from scratch using
publicly available data. This data is acquired from the publicly available service
Yahoo! Answers, specifically from the touristic question section. This choice
was driven by the fact that in that section is very likely to find multi-domain re-
quests due to the multifaceted subject. In a second phase, from this larger corpus,
a smaller but most interesting section is taken and analyzed in depth. Again this
analysis has two aspects. The first one is the splitting of a question in the diverse
domains that constitute it, and extract the important objects from those parts. The
second is the association of the resulting objects with one or more semanticaldomain of knowledge that will be mapped to the corresponding services.
All these analysis are elaborated in a web application environment called Sift.
More details will be explained in the implementation section.
2.5 WordNet
WordNet is a lexical database for the English language that aims at organizing,
defining and describing concepts through a semantic network. The organization
of the lexical features is defined with the grouping terms with similar meanings
called synsets and the linking of their meanings through a number of different
relations. The latest available version of the database (WordNet 3.0) contains more
than 150.000 terms organized in 117.659 synsets. Moreover, given the WordNet
success, a lot of lexical networks have been developed to link the WordNet terms
to other languages, as a multilingual search support. Semantic relations available
in WordNet are categorized according to the terms that are part of that specific
relations. Among nouns the principal semantic relations defined are: hypernyms,
hyponyms, holonyms, meronyms. Relations are defined also among verbs and
adjectives.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 23/96
CHAPTER 2. BACKGROUND WORK 21
2.6 WordNet Domains
WordNet Domains is a lexical resource, that can be considered an extension of
WordNet, in which synsets have been annotated in a semi-automatic way with
one or more domain labels. A domain may include synsets of different syntactic
categories and from different WordNet sub-hierarchies[1].
WordNet Domains contains 200 domains labels in a hierarchical structure (the
WordNet Domains Hierarchy) organized as in the Dewey Decimal Classification
(DDC), a general knowledge organization tool which is the most widely usedtaxonomy for library organization purposes. Each synset of WordNet 2.0 was
labeled with one or more labels, using a methodology which combines manual
and automatic assignments.
The whole infrastructure of the multi-domain search engine is based on the con-
cept of domain. A domain is considered as a self-standing field of interest for
the user, such as music, sport, arts, tourism, computer science, and so on. The
annotation of every synset in WordNet domains allows to characterize a domain
in terms of most frequently used terms for describing concepts in that domain,
and viceversa to identify for each synset the list of domains it refers to. One
of the most interesting and urgent task in search computing was to investigate
if WordNet Domains can facilitate the task of partitioning queries and associat-
ing them to specific search engines and data sources. The domain repository is
a data structure that is able to store domains as described above. In this solu-
tion, we assume that domains are organized as a taxonomy, representing a tree of
domain/sub-domain relationships. Information about the domains is made avail-
able to the other components through an API that exposes interfaces for querying
and updating the domain structure (i.e., creation, deletion, and update of domain
information, including associated synsets and services).
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 24/96
CHAPTER 2. BACKGROUND WORK 22
2.7 Stanford Parser
This tool implements a probabilistic lexical parser of English natural language
sentences. The outcome of the parser is a tree representation of the sentences that
is suitable for the problem of splitting the queries into sub-queries to be assigned
to different domains.
Probabilistic parsing is using dynamic programming algorithms to compute the
most likely parse(s) of a given sentence, given a statistical model of the syntactic
structure of a language. Models have been developed for parsing in several lan-guages: English (the corpus used for this research), Chinese, Arabic, and German.
The Stanford Parser is a Natural Language Processing suite of tools and libraries
that can be used in various tasks related to natural language analysis. In the context
of this research, it is used for its parsing abilities. It is based on a probabilistic
model and it is implemented as a Java library accompanied by a dictionary file
that is used as training data.
The very detailed parsing of a sentence or a period gives the possibility to try alot of different approaches to the splitting and analysis of the natural language. In
the framework in particular two main approaches have been researched: the first
level splitting and the clause level splitting. The research and result details about
these approaches are examined in the next chapters.
2.8 Name Entity Recognition
The tool we used as Name Entity Recognition (NER) is the CRF(Conditional
Random Field)-based NER system developed by the Stanford NLP Group[2].
Named entity recognition (also known as entity identification and entity extrac-
tion) is a subtask of information extraction that seeks to locate and classify atomic
elements in a text into predefined categories such as the names of persons, orga-
nizations, locations etc.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 25/96
CHAPTER 2. BACKGROUND WORK 23
Given a text in input, the NER system produces a parsed output that highlights the
entities found in the document.
In particular the Stanford system can recognize a great number of persons (famous
people or proper names), organizations (companies, government organizations,
committees, etc.), locations (cities, countries, rivers, etc) and other miscellaneous
entities. This system is trained on the the CoNLL-2003[3] named entity data that
consists of eight files covering two languages: English and German. The English
data was taken from the Reuters Corpus which consists of Reuters news stories
between August 1996 and August 1997.
The CoNLL-2003 data files contain four columns separated by a single space.
Each word has been put on a separate line and there is an empty line after each
sentence. The first item on each line is a word, the second a part-of-speech (POS)
tag, the third a syntactic chunk tag and the fourth the named entity tag. The
chunk tags and the named entity tags have the format I-TYPE which means that
the word is inside a phrase of type TYPE. Only if two phrases of the same type
immediately follow each other, the first word of the second phrase will have tag
B-TYPE to show that it starts a new phrase. A word with tag O is not part of aphrase. Here is an example:
U . N . N N P I - N P I - O R G
o f f i c i a l N N I - N P O
E k e u s N N P I - N P I - P E R
h e a d s V B Z I - V P O
f o r I N I - P P O
B a g h d a d N N P I - N P I - L O C
The data consists of three files per language: one training file and two test files
“Test A” and “Test B”. The first test file is used in the development phase for
finding good parameters for the learning system. The second test file is used for
the final evaluation.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 26/96
CHAPTER 2. BACKGROUND WORK 24
2.9 Technologies
Following is an overview of the main technologies and tools used to implement
the framework.
JavaScript Object Notation The Javascript Object Notation or JSON is a lightweight
data-interchange format similar to XML. It is a text-based, human-readable for-
mat for representing simple data structures and associative arrays (called objects).
It is based on the Javascript syntax for describing data structures. It supports avariety of data structures, the most commonly used in high-level languages. It
was chosen over other data exchange formats such as XML for its simplicity and
readability. Its usability and ease to map it to the data types provided by most
languages, makes it very natural to convert back and forth and it’s also supported
over a multitude of languages and frameworks, libraries having been implemented
in every high-level language available.
CouchDB CouchDB is an Apache Foundation project for a document-based
database server written in Erlang, a highly efficient language for concurrent and
distributed applications. It diverges itself from the model of relational databases in
many ways, and offers a very different performance profile. CouchDB stores free-
form documents instead of records as can be seen in a regular relational database.
Its schemas are flexible, and the elements can change from one document to an-
other in a same database. This can be useful in many different applications, such
as ones where schemas are highly likely to change over time, or in situations
where the rows are very sparse, that is, many fields are present but only few areactually used in a single document. The server is accessible via a RESTful JSON
API. JSON is its native data format, and this makes it very flexible in terms of
what data types can be stored. It also supports computed views, which replace
indices, and which are created in JavaScript by the user. These views follow the
Map/Reduce paradigm, where a first function (map) is tasked with going over ev-
ery document, emitting a key/value pair which can be any given JSON element.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 27/96
CHAPTER 2. BACKGROUND WORK 25
The second function (reduce), then sorts and groups elements by their keys, and
transforms and reduces the array of values associated with that key into a singular
atomic element. The contract is that the computation of one element is totally
independent from the computation of any other, allowing the system to distribute
the work, cache it aggressively and reorder it as needed in order to improve the
performance. It also supports keeping multiple revisions of a single document, al-
lowing the user to require a particular version. It also gives the opportunity to offer
optimistic conflict resolution for updates, where, during an update operation, the
sender is required to state which version its change is based on. If that version cor-
responds to the currently most up-to-date, the update is made without any trouble.In the other case, if another user had already updated the same document, an error
message is sent to the user who is then given the opportunity to rebase himself
on the latest version. Other interesting features of CouchDB are its core support
for master-master replication, where two nodes can be synchronized and where
both can still act as master, unlike the normal model master-slave model where
slaves are only used for read operations while the master is the unique point of
update. CouchDB was chosen within the context of this project with the idea that
the schema was most likely to change greatly over the progress of the research,and that the objects that we would need to store would not fit a relational database
very well.
Ruby Ruby is a high-level programming language known as being highly dy-
namic and flexible in regards to its syntax. Ruby supports multiple programming
paradigms, including functional, object oriented, imperative and reflective. It also
has a dynamic type system and automatic memory management. While its im-
plementation is relatively slower than other languages, it has become famous forallowing the creation of DSL, Domain-Specific Languages, where the host lan-
guage itself is adapted in order to create a more natural syntax adapted to the task
at hand. In particular, it has become famous for its use in the Web domain, where it
now sports a host of libraries adapted to quickly and efficiently creating Web Ap-
plications. It is a pure object-oriented language, where every method or function
is actually activated by sending a message to the desired instance. Every element
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 28/96
CHAPTER 2. BACKGROUND WORK 26
in the code can be considered as objects, even literal string and numbers. It fol-
lows in this the tradition of the Smalltalk chain of languages. It also allows the
re-opening and modification of already-defined classes, even the ones that make
part of the core and standard library.
Sinatra Sinatra is a Domain Specific Language (DSL) for quickly creating
web-applications in Ruby. It’s extremely simple while still keeping most of the
power of other frameworks, this simplicity also offers a great flexibility. It isn’t a
typical Model-View-Controller framework, but ties specific URL directly to rele-
vant Ruby code and returns its output in response. It does enable you, however,
to write clean, properly organized applications: separating views from application
code, for instance. Any given operation can be made within those blocks of code,
and the only contract is that they are expected to return a string of characters that
will be sent to the user. This string can be generated directly, or, as it is preferable,
created from the rendering of a specified template that will abstract away the view
part. Sinatra itself can be run on a number of application servers, which range
from small, focused ones such as Thin 4 or general Web servers such as Apache.
HTTParty While fetching and parsing data from an external Web service can
be made using low-level library, HTTParty makes it much easier. One simply
has to specify the URL of the service as well as optional parameters, such as the
developer key in the case of Yahoo! Answers, and HTTParty will take care of
fetching the data and translating it in a native format of its host language, Ruby.
Scala Scala is a high-level language closely based on Java, but taking on ideas
of the other functional languages, such as Haskell or ML, and others from dy-
namic languages like Ruby. It was created in Switzerland in a project led by
Martin Odersky, a lead designer on the Java language itself. Scala offers quasi-
total compatibility with Java, being able to import and export libraries compiled
in any language running on the Java Virtual Machine. In addition, while it sports
a Java-like syntax, it supports type inference, allowing users to skip explicitly
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 29/96
CHAPTER 2. BACKGROUND WORK 27
defining the type of each variable. In addition, it supports high-level functions,
pattern matching and an evolution of interfaces and abstract classes called traits,
inspired by Ruby mixins. Among its remarkable features can be found a library
that offers a new perspective on concurrent systems, called actors. This feature,
taken from languages such as Erlang and Smalltalk, allows one developer to con-
ceptualize systems as a series of independent processes called actors, where these
can communicate through the use of referentially-transparent messages. Actors
are implemented using a mailbox, in fact a queue where messages are stored. The
actor can then define its act method to handle these messages, often using pattern
matching to dispatch in view of the type of the message which can be arbitrary.Scala was primarily chosen because it offers access to the wide library of Java
applications. It was also then chosen over Java itself because it is more suited to
explorative programming, where one does not know exactly the shape the result
will take, as was the case at the beginning of this project.
Kestrel Kestrel is a queuing service we use to distribute work tasks amongst the
worker, and to send it from the server where the manager of the workers can reach
it. While it is quite new, it has proven its worth through use at Twitter Inc., where
it powers a lot of the hugely-popular communication service. The particularity of
this service is that it complies with the Memcached protocol. Memcached is the
most widely-used service to store transient data and is used as a cache to avoid
repeating costly operations. While it changes the semantics of this protocol, the
fact that Kestrel respects the simple get and set contract of Memcached allows
the use of a great number of libraries that, while originally made for Memcached
itself, can now be used transparently to send tasks to the Kestrel server. Its basic
semantics are that a set operation associates a key, with a queue, and the payloadgiven within the operation will be added to the end of that named queue. The get
operation instead takes the first element from that same named queue, or returns
a special message if no element can be found. Kestrel itself is implemented as a
daemon in Scala, a high-level language that takes most of its inspiration from Java
and is in fact compiled to Java bytecode, allowing to run seamlessly in the Java
Virtual Machine.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 30/96
CHAPTER 2. BACKGROUND WORK 28
The Google Chart API To build the pie chart graphics in the statistics section
we used the Google Chart tool. This tool provides a free service to dynamically
visualize image charts through a simple URL request to a Google server. The
URL requests are simple to build and very useful to embed graphical elements in
a web page.
There are a lot of chart types available. The chart type is specified by the cht
parameter. Data is specified using the chd parameter. Then it’s possible to set the
format to use for the data, like simple text format or use one of the encoding types
and specify the chart size with the chs parameter. It’s possible to add additional
parameters and each chart’s documentation lists the available ones that include
labels, titles, and colors.
A sample URL starts with http://chart.apis.google.com/chart? and is followed by
all required and optional parameters.
Example:
http://chart.apis.google.com/chart?chs=250x100&chd=t:60,40&cht=p3&chl=Hello|World
Figure 2.3: The sample pie chart
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 31/96
Chapter 3
Related Work
In the past a lot of work has been done in the information retrieval field using the
tools we presented in this chapter. We will explain some of the approaches in the
following paragraphs.
3.1 WordNet
WordNet has been used in numerous natural language processing, since its cre-
ation in the ’90s, such as part of speech tagging, word sense disambiguation, text
categorization, information extraction, with considerable success. Instead the use-
fulness of WordNet in information retrieval applications has been controversial.
Information retrieval is the process of locating documents relevant to a user’s in-
formation needs from a collection of different sources. The user describes his/her
information needs with a query which consists of a number of words. The in-
formation retrieval system compares the query with documents in the collection
and returns the ones that are likely to satisfy the user’s information requirements.
The main weakness of this is that the vocabulary that searchers use is often not
the same as the one by which the information has been indexed. One method to
solve this problem is query expansion[4]. The queries are expanded with terms
29
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 32/96
CHAPTER 3. RELATED WORK 30
that have similar meaning or bear some relation to those in the query, increasing
the chances of matching words in relevant documents. Expanded terms are gener-
ally taken from a thesaurus. Even with query expansions methods no satisfactory
results were really achieved, mainly because of some practical limitations of the
tool WordNet:
• Two terms that seem to be interrelated have different parts of speech in
WordNet. This is the case between stochastic (adjective) and statistic (noun).
Since words in WordNet are grouped on the basis of part of speech in Word-
Net, it is not possible to find a relationship between terms with different
parts of speech.
• Most of relationships between two terms are not found in WordNet. For
example how do we know that Mizuho Bank is a Japanese company?
• Some terms are not included in WordNet (proper name, locations etc).
3.2 WordNet Domains
This tool has been used mainly in the field of word disambiguation. The under-
lying hypothesis is that domain labels, such as Medicine, Architecture and Sport,
provide a useful way to establish semantic relations among word senses, which
can be profitably used during the disambiguation process. One of the first ap-
proaches to the word domain disambiguation process through WordNet domains
was from [5] where words in a text are tagged with a domain label in place of a
sense label, originally taken from the classic WordNet dictionary. They adoptedfrequency measures, based respectively on the intra text frequency and the intra
word frequency of a domain label.
In [6] it’s presented the Domain Relevance Estimation (DRE). Given a certain
domain, DRE distinguishes between relevant and non-relevant texts by means of
a Gaussian Mixture model that describes the frequency distribution of domain
words inside a large-scale corpus; DRE is a fully unsupervised text categorization
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 33/96
CHAPTER 3. RELATED WORK 31
technique. The correct identification of the domain of the text is a crucial point for
Domain Driven Disambiguation. Studies on the relevance of a text in the domain
context have been exploited by approaches like [7] where an approach based on
word sense disambiguation is presented. Using WordNet domains and retrieving
the domains available for each synset of a word it’s possible through different
approaches using distance vectors to calculate the most representative domain.
It’s assumed in fact that domains constitute a fundamental semantic property on
which textual coherence is based, such that word senses occurring in a coherent
portion of text tend to maximize their belonging to the same domain.
All these approaches have good results but use only domain information. To im-
prove the system recall, other information should be integrated in the domain-
based approach. For example supervised approaches that make use of local infor-
mation, such as word collocation and grammatical context.
3.3 Name Entity Recognizer
In the past a lot of work has been done using the entity recognizers as a kind
of intelligent parsers. They can recognize name entities more precisely than for
instance regular expressions; which can only identify a proper name but can’t
classify its meaning. So NER are commonly used as information extractors on
formal text as news, websites, as geographical information extractors [9] or as
personal information extractors from emails [8].
Their information retrieval function is very specialized on the kind of material un-
der examination, blogs, news or emails have different structures and kind of data
that require differently characterized NER to effectively return good results. More
over a lot of these recognizers are computational heavy because of the large sets of
training data they have to handle. The optimal solution to obtain the greatest num-
ber of entities from a document would be to combine an environment optimized
NER and a set of regular expressions.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 34/96
CHAPTER 3. RELATED WORK 32
3.4 Query splitting
The topic of query splitting or query segmentation has been analyzed in a lot of
papers. Very different approaches have been tested. The one examined in [11] is
based on retrieved result. The aim of this approach is to find interesting documents
that will link two queries that function as “stepping stones”. This method of pro-
ceeding is particularly useful in the academic and scientific article field. The two
queries can be provided by the user himself or they can be identified by the system
through the examination of the single query provided; this is done with an unsu-
pervised method that analyzes the various documents retrieved from the request
in the query and groups them according to common terms and characteristics.
In [12] it’s proposed an unsupervised approach based on query word-frequency
matrix derived from web statistics.They first adopt the N-Gram model to estimate
the query term’s frequency matrix based on word occurrence statistics on the web.
They then devise a strategy to select principal eigenvectors of the matrix. Finally
they calculate the similarity of query words for segmentation.
In[13] they use a generative query model to recover a query’s underlying concepts
that compose its original segmented form. The model’s parameters are estimated
using an expectation-maximization (EM) algorithm, optimizing the minimum de-
scription length objective function on a partial corpus that is specific to the query.
To augment this unsupervised learning, they incorporate evidence from Wikipedia
to exploit some external knowledge to make sure that the output segments are
well-formed concepts, not just frequent patterns.
The great part of effective approaches to query splitting are done using unsuper-vised methods. There are some natural language based query analysis researches
but they’re often very structured or referred to specific domains, more like natural
language interfaces to databases than natural language analyzers.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 35/96
CHAPTER 3. RELATED WORK 33
3.5 Matching
The query matching subject hasn’t been approached widely but we can find a sig-
nificant research in this article [14] where a generic query is routed to a proper
search service after an analysis by the automated query routing system. Off-line,
Q-Pilot takes as input a set of search engines’ URLs and creates, for each en-
gine, an approximate textual model of that engine’s content or scope, something
conceptually similar to SeCo’s semantic annotation for Service Marts. On-line,
Q-Pilot takes a user query as input, applies a query expansion technique to the
query and then clusters the output of query expansion to suggest multiple topics
that the user may be interested in investigating. Each topic is associated with a set
of search engines, for the query to be routed to, and a phrase that characterizes the
topic. For example, for the query “Python” Q-Pilot enables the user to choose be-
tween movie related search engines under the heading “movie — monty python”
and software-oriented resources under the headings “objected-oriented program-
ming in python” and “jpython — python in Java”. An important key point in
the Q-Pilot design is to use the neighborhood-based identification of search en-
gines’ topics in combination with query expansion. This approach gives quitegood results, as reported on the article; query expansion fills the gap between the
short query and the small number of terms in search engines’ topics. This system,
though quite efficient, is well-suited only for very short, single domain queries.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 36/96
Chapter 4
The Thesis Project Contribution
4.1 Objective
Complex queries make it possible to extract answers from complex data, rather
than from within a single Web page; but complex data require a data integra-tion process. In the SeCo project this process is query-specific because to answer
queries about very different topics we require intrinsically different data sources.
However, data integration is one of the hardest problems in computing, because it
requires full understanding of the semantics of data sources; as such, it cannot be
done without human intervention. A data source is any data collection accessible
on the Web. The Search Computing motto is that each data source should be fo-
cused on its single domain of expertise (e.g., travels, music, shows, food, movies,
health, genetic diseases) but pairs of data sources which share information can
be linked to each other to build complex results. This classification of the data
in different domain groups, represented by the service marts, is the basis for the
upper level query elaboration that will try to match the input with the available
data sources.
In fact the main objective of the thesis project is to enhance the existing natural
language analyzer framework and add a service mart matching function to match
34
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 37/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 35
the high level query to the services for a future service invocation. In particular we
aim to map user-specified queries with no fixed input forms to SeCo multi-domain
paradigm; this new feature will make it possible to test the output of the natural
language elaboration in the framework.
4.2 Hypothesis
Syntactical hypothesis: The clause-splitting divides in clauses, as the name
says, so, to achieve a good splitting result, each clause should be dedicated to
a defined section of the request. By this we mean that a question regarding a spec-
ified field or domain should be limited to a single clause so that it can be satisfied
by a web service or a group of them all semantically related.
Wrong example: “Is there a restaurant in Los Angeles in the proximity of a musi-
cal theater? I’d prefer thai food”
Figure 4.1: The trees retrieved from the analysis of the query
This request is formally wrong, considering our syntactical clause division, be-
cause the last clause “I’d prefer Thai food” would be completely unrelated to the
first one where other requests about the same domain are made. As visible in the
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 38/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 36
figure above two trees are retrieved because the parser considers the question fin-
ished when it finds the question mark symbol “?” and therefore it begins a new
tree for the following sentence
Correct Example: “Is there a Thai food restaurant in Los Angeles which is in the
proximity of a musical theatre? ”
Figure 4.2: The tree for the correct example
Another downside of this kind of splitting is the misinterpretation of requests
linked by conjunctions. Because of the nature of the parser only the clauses (ei-
ther relative, subordinate or sentences) are recognized so a request given as a list
of conditions linked only by conjunction and no verbs is identified in a singular
long sentence.
Wrong example: “I’d like a Thai Restaurant in LA, near a movie theater that
shows a horror movie and a hotel to spend the night with a spa.”
This example will be parsed in “I’d like a Thai Restaurant in LA, near a movie
theater that”, “shows” and “a horror movie and a hotel to spend the night with
a spa”. This splitting would be completely wrong in terms of domains division,
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 39/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 37
because the information on the movie should be linked to the previous part and
the information about the hotel should be in an independent sentence. The right
structure of this request would be:
“I’d like a Thai Restaurant in LA. A near movie theater with a comedy movie. I
would like to spend the night in a hotel nearby.”
The easy solution to overcome splitting problems is to put a full stop after any
part of the request that concerns a specific subject. This will limit the spreading
of the concepts in the sentence and keep them all in a single sub-query. The
main problem from this approach is the loss of connection information among the
requests that could be retrieved deeply analyzing the link between the clauses.
Example: “I’d like a Hotel at the Bahamas where I can go snorkeling.”
The conjunction “where” indicates that the user wants to find the activity re-
quested in the second sentence in the place named in the first one.If we split the
sentence and transform the second one in an independent one as in:
I’d like a Hotel at the Bahamas. I want to go snorkeling there.
The recognition of the link between the two sentences is not immediate anymore
and it requires a deeper analysis to be detected.
Service Marts’ Hypothesis: The complex structure of service marts explained
in Chapter 2 has a lot of characteristics that don’t specifically concern the func-
tion of query/service mart matching. So we decided to simplify the model to a
“semantic” version of it.
Below is the uml used to project the object used in the thesis application:
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 40/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 38
Figure 4.3: The semantic modelization of the Service Mart
As you can see we only kept the elements with a semantic value, so we omitted therepresentation of repeating groups, because they only indicate a structural value.
We also hypothesized that we had only semantic attributes and not quality indica-
tors like ranking. This feature is not yet retrievable from the queries through the
framework we built so we imagined that for now it could be treated automatically
(intrinsic property of the order of the results) or parametrically (the user is given
the opportunity to decide about it).
Another feature that has not been considered is the join relation between different
service marts or access patterns. This feature is a very important one in the SeCo
architecture. The possibility to link different search services with join paths gives
the power to answer the greatest part of multi-domain queries; in fact it’s assumed
that a user won’t repeat every bit of data in the request for the number of times
the splitting in domains requires. The repetition of the linking parameter through
a join path is vital in these cases. With our implementation we can match only a
smaller range of multi-domain queries to service marts.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 41/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 39
A third simplification is about Service Attributes that are presented as the semantic
version of Service Interfaces.
4.3 Query Analysis
From the corpus of entries retrieved, we proceeded to the creation of techniques
that will allow us to recognize and extract the domains from a question. This pro-
cedure spanned many different phases, each one of them requiring different steps,
from the elaboration of diverse strategies, the application of them on the corpus,
and the evaluation of the results. All the entries inserted in the Sift database are
processed in two main steps:
• the parsing
• the splitting
4.3.1 The parsing
It transforms the linear structure of the sentence in a tree representation of the
grammatical elements. This is the starting point of the splitting in multiple do-
mains. The tool used in this phase is the Stanford Natural Language Parser. It
is a Java-based library, that, accompanied with a trained corpus of data for the
English language, will obtain a parse tree with each atom being annotated with its
role, and the different structures corresponding to the parts of the sentence (object,
verb, complement), being grouped into the tree.
4.3.2 The splitting
This step divides the input entry into multiple parts where each part should cor-
respond to a single domain that will be searched. The precision of the structure
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 42/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 40
obtained in the first part becomes very important, as it is exploiting that struc-
ture and its properties that we will find many opportunities to split an entry into
many different parts. Many different techniques were thought about, but two main
techniques were retained and tested on the input:
• First Level Split: This split is done directly at the first level of the sentence,
on the assumption that the upper levels of the trees will be divided in differ-
ent sub-sentences where each one corresponds to a single unique domain.
• Clause Level Split: It splits directly according to the parser’s recognition of
clauses, either subjunctive or relative. While splitting a sentence into dif-
ferent parts is an important step towards getting the final domains out of an
entry, it still returns a response that is too coarse. We must thus reduce each
part to its simplest objects, and the ones that will characterize the domain
in which we find ourselves. Thus, from each part, we keep only the nouns
and the verb if they correspond to meaningful actions (e.g drive, rent, cure).
The clauses recognized by the parser and filtered by the splitter are: plain
sentences, relative clauses, subordinate clauses, relative clauses (questions).
The identification of only nouns and verbs during the splitting brings a deficiency
of information about the quality of the request because adverbs and adjectives are
discarded. This choice was done because only names and verbs are annotated
with domains in the WordNet Domains database. The lacking of quality defining
expressions during this stage of the process has to be compensated in the following
steps of the query analysis.
4.4 The Extraction of the Data Types
The third step in the query analysis process is the extraction of different types of
elements in the query. The typical form that a user has to fill when he’s using a
web service usually requires data that belong to a lot of different formats: dates,
prices, names, titles are only few of them.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 43/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 41
To efficiently analyze a natural language query the system has to recognize and
identify the greatest number of parameters in input. If a parameter can be la-
beled with its format the probability of a good matching between the query and
the service attributes will be higher. For this reason we have decided to use a
Name Entity Recognizer to extract name entities from the queries. We considered
the use of different NERs and we finally decided to use the one implemented by
the Stanford group because it was completely compatible with the libraries from
the parser already used. This NER can recognize entities in the form of: Per-
sons, Locations and Organizations. These “proper noun” words are recognized by
the mean of a large training set that functions as a great database of information.There are a lot of other NERs that perform in a similar way and can recognize
more entities in the form of numbers, dates etc..., the choice on this particular
NER was firstly because of the compatibility with the project already developed
and secondly because we believe that a more efficient recognition of “standardiz-
able types” can be achieved with the use of regular expression. This led us toward
a simpler NER rather than a complex and multifunction one. We call “standard-
izable types” all the data types that follow a standard pattern in their expression.
Prices for instance are always numbers followed or preceded by the symbol or thename of the currency, titles, if written correctly, are delimited by double quotes
(“I’m a Title”), distances have the same characteristics as prices with a unit of
measurement symbol and so on. The choice of using regular expression was done
because having the ability to change the expression in our program lifted us from
being dependent on a NER. These can be very useful for entity recognition on
“natural” words, machine learning algorithms and big training sets are not an easy
task to handle, but not be powerful enough with standardizable types.
4.5 Mapping to domains
From the basic objects identified in the clauses, nouns and verbs, we use another
set of tools and techniques to extract domains out of them. Subsequently in the
SeCo application domains can be mapped to a web service. In order to do this, we
focused on the tools provided by the WordNet project, and especially the add-on
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 44/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 42
of WordNet Domains. The approach used is to parse the dictionary of WordNet,
which is organized in words which relate to one or more synonym sets or senses
of a word, also called synsets. Each synset has a unique identifier consisting
of its offset within the WordNet database. We use this identifier to connect a
domain to its associated domains within the WordNet-Domains database, where
the key is the synset offset and its values are one or more domains. In fig. 4.4 the
relationships we follow to get the domains are represented.
Figure 4.4: The WordNet Domains Hierarchy
The domain retrieval process can result in a large number of domains from a sin-gle word. The perfect approach (from the human mind point of view) would be
to identify for every word which one is the sense which it refers to, in the given
sentence, and retrieve the domains accordingly. Given the inner difficulty of this
task in order to get the most relevant domains, we use the tf-idf[10] information
retrieval technique. This is a sorting mechanism that calculates the importance of
a single domain by its relative presence in a single word, over how common it is
across all the domains we retrieved from the objects of the sub-entry. A second
technique that was evaluated is to retrieve the domain relationship directly from
WordNet, which gives a word definition a relationship to another word of which
it is the topic. WordNet is organized as an index of words to their possible senses,
and a database containing details about such senses. In particular, information
about the relationship between the current senses and others is kept in that file.
There are many kind of relationships, such as is-a, is-part-of or, as we wish to ex-
tract, is-member-of-this-domain. These relationships allow one to go from sense,
or synset, to sense, forming a graph spanning the whole of the database. The
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 45/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 43
approach using WordNet topics (i.e. the is-member-of-this-domain relationship)
was discarded because of the scarce data in the database, in fact only for very few
words a relationship of such kind was available.
To retrieve the greatest number of information from the queries another domain
retrieving method was implemented. This method exploits the function of the
Name Entity Recognizer. The extraction of the domains from the database of
WordNet domains interests only names and verbs, since only those are retrieved
by the parser. With the use of the NER we can recognize some entities, as ex-
plained before, and then use their definition to retrieve the domains. This will
allow us not to lose much information due to the presence of proper nouns.
Example: “I want to go to Los Angeles by plane from Milan and find a hotel near
George Clooney’s house”
Entities recognized:
• Locations: Los Angeles/Milan
• Organizations: none
• Persons: George Clooney
Words from which the domains will be retrieved: want, go, Location, plane, Lo-
cation, find, hotel, Person, house
4.5.1 Methods to improve the domain score
The tf-idf method of ordering domains for each sub-entry gives quite good results
but we believe that it can be improved with the use of other methods that alter and
tweak the scores. We propose three methods:
• Most frequent couples: we calculate the frequency of the couples of do-
mains among the queries already examined and assign a bonus to the most
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 46/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 44
frequent couples. Using this criteria “distant” or very different domains can
be assigned a higher score on the basis that they’ve been found together in
a query a lot of times.
• Most frequent couples in Service Marts: the same approach can be ap-
plied to the scoring of the Service Marts domains. An offline analysis can be
done and a bonus can be assigned to the most frequent couples of domains
among the Service Marts annotations.
• Nearest domains: a bonus is assigned to the domains in the query that are
“near” considering the distance on the WordNet domains tree.
Only the third method was actually implemented for many reasons, first and fore-
most the absence of a testable, reliable and numerous database of queries and
service marts.
Other researches along this line in the future can involve more complex data min-
ing methods and approaches. The biggest problem in this domain scoring ap-
proach is the small number of domains. Even if we could achieve a perfect or-
dered list of domains for each sub-entry the meaning would be very poor wrt thepossible available annotations used on the service marts side. Another possible
approach to this problem would be to somehow use the retrieved domains to find
some matching synsets that could be useful in the following matching process.
4.6 The Service Mart Repository
The SeCo project is still a work in progress and the registration for the service
marts is not active yet. Therefore, to efficiently test our query analysis and match-
ing processes we decided to create a list of fictitious service marts with character-
istics and parameters very close to the real ones. We used as model and inspiration
the ones presented in the YQL database. The service marts semantic value spans
over a great number of domains and they’re complete with data types descriptions
and multiple access patterns. Thus we populated a repository with approximately
70 service marts that we used in our experiments.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 47/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 45
4.7 Map sub-queries to Service Marts
To integrate the Sift application with the main SeCo project a “Query to Service-
Mart” mapping is needed. This mapping is based on the domains associated to
each one of the keywords found in the query and the corresponding semantic def-
inition of the ServiceMart. From the available list of service marts we extract for
each sub-query a list of suitable ones according to their semantic annotation. The
scoring system used in the matching to order the retrieved service marts is re-
ferred to the individual score each matched domain had according to the previous
calculations.
This is a quite simple approach to the matching problem but nonetheless quite
effective because of the good scoring system for the domains used previously.
A further development to this approach would be to consider also other semantic
annotation that could be used for service marts, like synsets. To support this
kind of annotation we would have to extract domains from those synsets and then
choose the right service marts according to their new semantic description.
4.8 Map sub-queries to Access Patterns
Once all the sub-queries are matched with a list of potentially compatible service
marts, we have to specifically match the data we previously retrieved to the at-
tributes in each Access Pattern. Access Patterns are a sub-entity of the service
marts and they contain all the service definitions that ultimately have to be con-
nected to the parameter in input from the user. The ultimate goal of this task is
to match each of the identified sub-queries to one or more services, handled by
the Access Patterns, so that the following processing of the query can take place
according to a well-defined query execution strategy. Every single data entity re-
quired by the services has to have a counterpart in the user request so that the
former can be invoked and give back a result.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 48/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 46
Figure 4.5: Example of sub-query/AP matching
Every Access Pattern is composed of a number of service attributes that define
the searching capabilities of the mart. These service attributes are annotated se-mantically with domains and synsets. We also hyphotyze that the service provider
will indicate for every attribute a data type chosen in the enumeration we defined
in the uml of fig.4.3. With the definition of these annotations we can then match
every parameter, starting from the mandatory ones, to the available access pat-
terns. The matching is done respecting the order of the data both in the Extracted
Data structure as in the Access Pattern one. We assume that this will be an advan-
tage for temporal and space parameters that are usually placed in a certain order:
Start->Destination, DateOfDeparture->DateOfReturn.
The matching process is not a trivial one. Not all the parameters can be univocably
matched. For instance an imprecision in the analysis can miss the name of a
location or an organization; therefore when these parameters will be requested
from the Service Attributes no matching will be found. For this reason we decided
not to develop a singular static matching but a dynamic one that checks more than
one data type for each request according to the nature of the data requested. For
instance if a word labeled with the type of Title is not available in the sub-query
examined then a simple expression labeled “word” will match. The same wasdone with all the numeric types, like dates, prices and times.
4.8.1 The semantic name matching
A different analysis was done on name entities. These are special kind of enti-
ties because differently from the others they have a semantic annotation that we
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 49/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 47
retrieved from WordNet and WordNet domains. This annotations can be useful
when more than one name parameter is needed. Through the calculation of a
matching score very similar to the one done previously for the sub-query/service
mart matching we sort the eligible names by highest compatibility and choose the
more suitable one.
4.8.2 Evaluation Criteria and Statistics
Some improvements from the original Sift application were required to filter ef-ficiently the queries among the raw set we acquired from our source Yahoo! An-
swers.
The Yahoo! Answers input structure requires the users to type a “title” and a
proper question in the form, the Sift application only acquires the “title” of the
question since often the complete text of the question is too long and filled with
other objects such as links, very unuseful for our analysis.
Due to this choice a lot of filtering to eliminate incomplete queries or inconclusive“titles” has to be done.
The preprocessing
During the preprocessing of the queries a basic filtering is manually applied and
entries are deleted if no question is asked or there are grammatical or spelling
errors in the keywords of the phrase. An improvement to this phase has beenadded to the application providing a correction form, for every entry retrieved,
which can be useful to correct and update the sentences with typos, spelling errors
and abbreviations without having to eliminate them. Also the option to eliminate
an entry or more entries altogether has been added.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 50/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 48
The query evaluation criteria
Since all the queries we acquire have to be evaluated manually (by a human be-
ing) to separate the multi-domain ones, optimal for our purposes, from the single
domain or ill-defined ones, a general evaluation criteria is required.
The evaluation, originally based on a 5 star rating has been changed to a 3 star
rating and a 1 to 3 score must be assigned to every query during the process.
We define here an evaluation criteria to standardize this phase and make it possible
to everyone who is evaluating the queries to do it within some guidelines.
• 1 Star - Single domain query (Not useful for our purposes)
• 2 Stars - Ambiguous multi-domain query ( A multi-domain query not suit-
able for our research i.e. “What is the best birthday gift for my wife?”)
• 3 Stars - Multi-domain query
Statistics
Once the entries are processed a 6-splitted screen is presented to the user. These
6 sections illustrate respectively the results of: the first-level split, the clause split,
the domain extraction based on the clause split and the optimized version of it
and the service mart matching retrieved. These aspects of the processing can be
evaluated manually with a 3 star rating. To give a more user friendly approach to
the results some statistics graphs have been added to the sift application.
First-level split
The first level split is the task that takes the considered entry and splits it based on
the first level division found in the stanford parsing tree. It finds the first internal
node that has more than one child, and takes each child as a different section.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 51/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 49
Then in each section-subtree we look for interesting elements, nouns and verb
atoms and take them as the objects.
The evaluation criteria is defined as:
• 1 Star - Completely inadequate splitting considering the domains of the
entry
• 2 Stars - Lack of precision in the splitting or in the extraction of the key-
words
• 3 Stars - Precise division of domains and keywords
The statistics count the star ratings, so that it’s possible to evaluate the split method
among a great number of entries. Also a statistic on the number of keywords we
are presented with has been implemented.
Figure 4.6: Example statistics first level split
Clause split
In this splitting method the tree is visited in a depth-first, left-to-right manner,
buffering elements in a domain until a new clause is encountered. From each
buffered part, we then take all the leaves and filter out the ones that are neither
nouns nor verbs, and send those as the resulting objects.
The evaluation criteria is defined as:
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 52/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 50
• 1 Star - Completely inadequate splitting considering the domains of the
entry
• 2 Stars - Lack of precision in the splitting or in the extraction of the key-
words
• 3 Stars - Precise division of domains and keywords
As we did for the first-level split a statistic on the star ratings is implemented to
evaluate the method, and also a statistic on the number of split clauses for each
sub-entry that will be useful for the analysis of the results.
Figure 4.7: Example Statistics for clausesplit
Domains Extraction
After the evaluation of a number of entries we could observe that the splitting
done by the first level split had lower ratings than the one done by the clause split
so the latter has been chosen to extract the domains.
The evaluation criteria is defined as:
• 1 Star - Completely inadequate domain extraction
• 2 Stars - Presence of inadequate and off topic domains with an high score
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 53/96
CHAPTER 4. THE THESIS PROJECT CONTRIBUTION 51
• 3 Stars - Precise extraction of the domains
In this section the most useful statistic is the one on the number of domains ex-
tracted. The main problem in this extraction is the vast number of domains ex-
tracted from a single entry with a relatively small number of keywords. This is
due to the presence of multiple synsets corresponding to the single words in the
WordNet database.
Figure 4.8: Example statistics for domains extraction
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 54/96
Chapter 5
Implementation
5.1 The system general architecture
Figure 5.1: The Architecture schema
The framework that powers this research environment is centered on two top level
tasks: the creation of the corpus and the use of this corpus within the context of
query analysis. The tools created to support and enhance these tasks are flexible
and can accommodate the diverging needs that are centered around the single
52
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 55/96
CHAPTER 5. IMPLEMENTATION 53
set of data. The center of all is the web front-end that powers the creation of
the corpus of queries as well as functions as a visualization tool for the outputs
of the algorithms, employed to analyze the queries and extract the domains in
the underlying application, and for the statistical results section. The front end
communicates with the outside through the retrieving of questions from Yahoo!
Answers from their web service. This feature is on request so the user that is
browsing the Sift web page can ask to retrieve questions from the outside. These
questions will be showed as new unrated entries and will be saved in the database.
The load of the process required to analyze the queries is non-negligible, both
in terms of CPU usage as well as memory, so it’s not advisable to require theextraction and the analysis to be done in real time in the same environment as
the database and the Web front-end. Therefore a mechanism to offload the work
into another computer has been devised. This is based on a standard and simple
architecture for background workers, the Web front-end, or the user through the
command line, can post work items on a queuing server where it will be picked
up by one of the clients, the first one who is available to compute. The client then
can process the task, given the input parameters to elaborate, and store the results
back in the database from which the web front end will retrieve them later after auser request.
5.2 The Sift Application
Sift is the Web Application composed of the front-end and the tool used to extract
data from the Yahoo! Answers Web Service. It is based on the Sinatra framework,
and thus it has been written in the Ruby programming language. In addition tothe main Sinatra library for web application development, it imports the Ruby
libraries used to interface with the CouchDB server, the Kestrel queue service as
well as the Yahoo! Answers Web Service. The application is divided into three
principal parts: the Models, the Controllers and the Views.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 56/96
CHAPTER 5. IMPLEMENTATION 54
Figure 5.2: The models
There are two models in Sift, as you can see in Fig.5.2. The first one refers to
the entries that are stored in the document database. These entries contains the
input retrieved from Yahoo! Answers, they also contain the rating given to the
entry itself, the results of the elaboration on the entry and their evaluations. The
creation or update of an entry makes the tool automatically send a message to the
queue server where later the task can be picked up by the background worker,
application that will analyze it. The second model corresponds to the result of a
query made to the Yahoo! Answers website, before it is inserted in the document
database where it becomes an Entry. It contains the fields of the content provided
by the Yahoo! API, which, of interest to us, are the category identifier, the question
identifier, the question title and the body.
The controllers in Sift correspond to a series of functions that are called when a
pattern matching a URL is matched. The most important method in this case is
the one for the index page, where most of the work is done. Its basic task is to
prepare the list of entries that are to be shown to the user. In order to do this, it
takes a list of parameters fed by the user. These can include the number of items to
show, the page to show, the ordering, a filtering by the rating given to an entry or
the lack of it. The interface allows for the manual creation of an entry, for which
a method is thus available. Also a method to modify and delete one or multiple
entries has been implemented so that the filtering and elaboration of the input can
be easier. Other methods available are methods for the user to rate an entry, either
as a whole, in the case of the extraction of the corpus, or for an specific feature,
useful in the matter of the evaluation of the strategies.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 57/96
CHAPTER 5. IMPLEMENTATION 55
Figure 5.3: Screen of the sift application
The last part is the view module, where a template is processed, taking as input
the different variables prepared by the controller. In the index page, where thelist of entries is shown, the output HTML contains the list of all the entries, with
each entry containing the results of the processing done if there was any, although
this part is hidden at first. In the index page a list of summaries is shown and
can be navigated through, either using the mouse or the keyboard. Elements can
be given a rating by clicking the corresponding star to the right of it. It is also
possible to change the rating on more than one element at once by selecting them
first and then using the drop-down action menu or the keyboard shortcuts to give
it a new rating. To see the complete details for a single entry one has to click on
it and the screen will show all the retrieved data, the data has to be previously
processed by the background worker system, in which case, it is possible to see
the alternative strategies and rate them individually. Otherwise the phrase “This
entry hasn’t been parsed yet” will appear on the screen. This interface was created
using the jQuery toolkit for JavaScript, which allows a high-level view of the web
page, allowing to query and manipulate elements, as well as make asynchronous
calls to the servers.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 58/96
CHAPTER 5. IMPLEMENTATION 56
5.2.1 Bee - Distributed Background Processing
All the procedures and analysis functions that were introduced in the previous
chapters will find their place in the framework in this section as modules of the
background worker.
In the framework of the background worker, used to test query analysis and do-
main extraction strategies, we can find the great part of the tool chain created
expressly for this research. It was decided to create such a system in order to be
able to allow asynchronous processing, as well as being able to offload the main
server. The system, nicknamed Bee, was created in Scala and uses libraries to con-
vert to and from JSON, to access CouchDb as well as the Kestrel queue service.
Refer to the fig. 5.4 for an overview of the classes that compose this framework.
Figure 5.4: The Bee Structure
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 59/96
CHAPTER 5. IMPLEMENTATION 57
Tasks
The central concept in Bee is that of a task. A task is a single unit of work that
is coded by the user of the library. Tasks are centered around one function, run,
which takes as parameters the original input fed from the queue, as well as the
results of the previous tasks. In turn, the tasks return the results of their calcula-
tions in the JSON format. These tasks are organized in chains, where each chain
is given a unique identifier. It is the ultimate result of the chain that is stored back
into the document database.
Method Description
setup(configuration) Optionally implemented, provides a mean for the task to
access its configuration, and use it to set itself up, for
example by loading data dictionaries.This is only done
once when the task is first loaded, so it is a good
opportunity to cache things.
identifier: string This method returns the unique identifier of the task within
the chain.
run(inputParams): json This is the core method of the task. It takes as input a Map
from the queue as well as the results of the previous tasks
in the chain, arranged according to their identifier. The
output of the function should be a json value. Any
exception thrown by this task will interrupt the chain and
will be stored in the errors fields in the database.
version:string Allows a task to report its version, forcing the
re-computation of all the elements in a chain starting from
this task.
Table 5.1: The task interface
Workers
The workers interact with the queue service and overview the execution of the
tasks in their chain. At the initialization, they are given a unique identifier, the
name of the chain as defined in the configuration. They then setup every task
before opening a connection on the queue, waiting for messages in the queue
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 60/96
CHAPTER 5. IMPLEMENTATION 58
named by their identifier. Once a message is received, they first check into the
database to see if this particular instance of the chain had already been executed.
If that is the case, the existing data is fetched and parsed. The tasks will be skipped
as long as no errors have been encountered and no version of a task has changed.
Once all the remaining tasks have been executed, the worker asks the database
actor to store the updated document before resuming listening on the queue.
Splitting Tasks
Splitting Tasks are a specialization of a generic task, in which the parsing and
serialization is already taken care of. The user only has to define a function, split,
that will transform the input from a tree-based representation to a series of parts.
Each part is an instance of a class composed of two fields, the first being the
phrase, or sub-sentence that is considered as that part. The second element is the
list of the objects retrieved from that phrase, where each object is a couple made
of the object itself as well as its part of speech (e.g. verb, noun or adjective).
Domain Extraction and matching task
These tasks have been united in a single big task because of implementation and
resources requirements. They correspond to the third phase of the query analysis
process, where the different parts of a sentence are analyzed in order to obtain a
series of domains that are later mapped to different query services. This interfaces
once again have one single method to implement. This method, named extract,
takes the list of parts obtained from the previous operation, and have to return, foreach part of the sentence, a list of possible domains and a list of matched Service
Marts. If nothing of importance is found or if a word is not recognized the output
list can be empty.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 61/96
CHAPTER 5. IMPLEMENTATION 59
5.3 Procedures
5.3.1 Parsing
Once the framework was established and stabilized, the first tasks were imple-
mented. The first of these were the parsing strategies, which take as input the
natural language sentence and output the grammatical structure. This output will
have the form of an arbitrarily-deep tree, where a leaf represents an atom, a word
of the sentence, while an internal node represents a grouping of these words insome structure, for example in a noun phrase or verb phrase.
Parsers evaluated Different parsers were evaluated to test their performance
wrt our elaboration needs. The first parser to be evaluated was the Stanford Nat-
ural Language Parser. Distributed as a Java library, it requires little code to use.
One simply needs to load the parser with the chosen training data file, and then
apply the parser to a sentence to get a resulting tree. This tree is then transformed
in order to take it from the native tree representation of the Stanford Parser intoa generic one that is to be used by the later tasks. Here is an example on how to
load and apply the parser, in Scala.
/ / d a t a F i l e p o i n t s t o t h e t r a i n i n g d a t a
/ / s e t o n t h e h a r d d r i v e
/ / i n p u t c o n t a i n s t h e n a t u r a l l a n g u a g e s e n t e n c e
i m p o r t e d u . s t a n f o r d . n l p . p a r s e r s . _
v a l p a r s e r = n e w L e x i c a l i z e d P a r s e r ( d a t a F i l e )
v a l t r e e = p a r s e r . a p p l y ( i n p u t )
Another parser tested was the Shallow Parser, developed at the University of Illi-
nois at Urbana-Champaign. This takes a different approach to obtain the final
result. It works by using a series of different tools that will process the input
into a more and more complex form. The first of these steps is to sanitize the
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 62/96
CHAPTER 5. IMPLEMENTATION 60
input, make sure that every element is well tokenized; that is, every element in
the sentence is spaced out, even the punctuation. It also produces some slight
transformations and normalization operations. The output of this first operation
is then sent to a second program, which takes care of tagging each element of
the sentence with its most probable part of speech, be it noun, verb, adjective or
other. This is then finally sent to a server called the chunking server, which takes
the annotated input and groups, or chunks, elements into what it thinks are the
primordial structure of the sentence. The shallow name thus comes from the fact
that this grouping operation is only done at one level, which means that the output
could be formally defined as a sequence of elements which can either be atoms,or sub-sequences of such atoms. This output, given as text by the server, is then
parsed by the task and put into a tree representation, although it is only going to
be one level deep.
5.3.2 Sentence Splitting Strategies
Once we have a tree with a satisfying parsing structure, in our case we chose theStanford tree version, we proceed to the division of that structure into many parts,
with the expectation that each part will correspond to a single semantic domain.
In output, each part of the sentence is represented by an instance of the “part”
class, which contains the extracted sub-sentence as well as the objects that are
considered important to the definition of the domain.
First-Level Split
A first strategy to split the sentences is to suppose that the first level at which a
separation of the sentence occurs defines the various domains. The purpose of the
task is thus to find the first internal node that has more than one child, and take
each child as a different part from which a domain will be extracted. From each
sub-tree, we look for interesting elements, nouns and verb atoms and take them
as objects. While this first attempt at splitting the tree is simple and does not take
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 63/96
CHAPTER 5. IMPLEMENTATION 61
into account the subtleties of the resulting parse tree, it gives a good base line and
provides a jumping point from which we can explore better techniques.
Clauses Extraction
Given the fact that sentences are most of the time organized in subject-verb-object
form, and that the object is often the one that has the most chances of having a
subjunctive or relative clause, we can expect the tree to be leaning most of the time
on the right, a fact that the previous technique does not take into account. In orderto fix that, a second technique has been implemented, where the tree is visited
in a depth-first, left-to-right manner, buffering elements in a domain until a new
clause is encountered. Such a clause is encountered when we find an internal node
labeled as: plain sentences, relative clauses, subordinate clauses, relative clauses
in the form of questions. From each buffered part, we then take all the leaves and
filter out the ones that are neither nouns nor verbs, and send those as the resulting
objects.
The information Extraction and Matching algorithm In the following para-
graphs we will go into the implementation details of the core activity of the thesis
project: the information extraction and matching algorithm.
In the figure below you can see the complete structure of the program, from the
extraction of the Part object (a sub-entry) to the matching of each of them to a
proper service mart.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 64/96
CHAPTER 5. IMPLEMENTATION 62
Figure 5.5: The algorithm schema after the splitting
5.3.3 Information extraction
Part I
In this stage of the implementation we worked on the sub-entries to extract the
greatest amount of information out of them.
This was done through a 4-step process the structure of which is illustrated in theschema:
Figure 5.6: The Information Extraction Flow
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 65/96
CHAPTER 5. IMPLEMENTATION 63
Name Entity Extraction The name entity extraction tool is used to examine
each object extracted from the sub-entry. To do this we first initialize the classifier
with the training set, then using as input the string value to examine we extract
a list structured as: List<Triple<String,Integer,Integer>>. This structure contains
the name of the entities extracted, either Location, Organization or Person, and
the offsets of the values they refer to in the string examined.
/ / i n i t i a l i z e t h e c l a s s i f i e r t r a i n i n g s e t
v a r s e r i a l i z e d C l a s s i f i e r = " c l a s s i f i e r s / n e r - e n g - i e . c r f - 3 - a l l 2 0 0 8 - d i s t s i m . s e r . g z "
/ / i n i t i a l i z e t h e c l a s s i f i e r
v a r c l a s s i f i e r = C R F C l a s s i f i e r . g e t C l a s s i f i e r N o E x c e p t i o n s ( s e r i a l i z e d C l a s s i f i e r )
/ / e x t r a c t t h e i n f o r m a t i o n
v a r g = c l a s s i f i e r . t e s t S t r i n g A n d G e t C h a r a c t e r O f f s e t s ( s e n t e n c e )
From the variable g then we get the types of entities extracted and label the values
examined accordingly. This entity labels will be then used to extract the domains
from WordNet, task that otherwise would be impossible for proper nouns.
Domain Extraction All the sub-entries in output from the splitting methods are
stored in specific objects called “part”. These objects contain the original version
of the sentence in the sub-entry, all the noun and verb object retrieved by WordNet
as well as the nouns retrieved by the Name Entity Recognizer.
In the schema below you can see the structure of the retrieval of the domains from
each object saved in the part structure (O1, O2, O3). This retrieval is done by the
exploration of the WordNet domains database; for each one of the objects there
may be more than one group of domains to be retrieved, groups D1, D2, D3 refer
all to O1, due to the subdivision in multiple synsets, fig4.4.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 66/96
CHAPTER 5. IMPLEMENTATION 64
Figure 5.7: Domain extraction algorithm structure
Tf-Idf After we extracted the domains from the basic objects, in order to get
the most relevant domains, we use the tf-idf information retrieval technique as a
sorting mechanism, calculating the importance of a single domain by its relative
presence in a single word over how common it is across all the domains we re-
trieved from the objects of the part. This technique allows us to calculate how
much information carries a word within a set of document. It has two principal
components: the Inverse Document Frequency, or idf, takes into account that a
word that is very common within its whole collection does not give much infor-
mation, as it lacks uniqueness. So points are removed from a word if it appears
in too many documents. On the other hand, a word is considered as important to
a document as the times it appears in it, which the tf or Total Frequency indica-
tor purports to calculate. These two results are combined to give a final score to
each unique word within the part of the sentence, which we then use to order the
results, starting from the highest score.
In particular the formula used in our case was:
t f f o r d o m a i n i = ninTot
ni =t h e n u m b e r o f t i m e s t h e d o m a i n
ii s e n c o u n t e r e d
nTot = t h e t o t a l n u m b e r o f d o m a i n s i n t h e s u b - q u e r y
i d f f o r d o m a i n i = log D
a(i)
D =t h e t o t a l n u m b e r o f s e n s e s f o r t h e c o n s i d e r e d p a r t
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 67/96
CHAPTER 5. IMPLEMENTATION 65
a(i) =t h e t o t a l n u m b e r o f s e n s e s i n w h i c h t h e d o m a i n
ia p p e a r s
Bonus Score Refinement After we obtained an ordered list of domains for each
part we researched a method to improve and refine the scores. We decided to im-
plement a feature that assigned a bonus score to each domain that had a connection
to the other domains of the part. This connection is based on the relation between
each word in the WordNet domain hierarchy. We reproduced the model of the
hierarchy in our program and with a simple method we can define if two domains
are related, as father-son or siblings, or unrelated.
Figure 5.8: A graphical sample of a substructure of the WordNet hierarchy
We decided to advantage with a bonus score the near domains (siblings) in each
group retrieved. This was done with the assumption that the probability that re-
lated domains in a given hierarchy are related also in a sentence is higher. So forexample if we are given a sentence as in the following example:
We can see the domains retrieved from this sub-query on the left and the optimized
version on the right. The domains whose score has been increased are: geogra-
phy, tourism, transport, politics, geology. Respectively the groupings of domains
which are siblings are geography/geology (as seen in fig. 5.8) and tourism/politics/transport.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 68/96
CHAPTER 5. IMPLEMENTATION 66
t o t r a v e l t o N o r t h K o r e a f r o m
A m e r i c a ?
3 . 2 3 3 1 g e o g r a p h y
1 . 5 2 2 2 t o u r i s m
0 . 7 6 1 1 h i s t o r y
0 . 5 0 7 4 t r a n s p o r t
0 . 3 8 0 5 p o l i t i c s
0 . 3 8 0 5 g e o l o g y
t o t r a v e l t o N o r t h K o r e a f r o m
A m e r i c a ?
3 . 8 7 9 7 g e o g r a p h y
1 . 8 2 6 7 t o u r i s m
0 . 7 6 1 1 h i s t o r y
0 . 6 0 8 9 t r a n s p o r t
0 . 4 5 6 6 p o l i t i c s
0 . 4 5 6 6 g e o l o g y
Table 5.2: Bonus Refinement example
We can see that more relevant domains get an increase in value though the order-
ing of the domains doesn’t change in this case.
Part II
Extraction of data types Adding to the entity extraction previously done with
the NER, we also implemented various data type extraction using regular expres-
sions.
We decided to use a single expression for the extraction of each data type with the
main characteristic of the type in question. This choice was done so that the only
change needed to update the extraction algorithm for each data type would be the
modification of the regular expression.
Here we’ll do an overview of the regular expressions used.
Titles: All the expressions included in double quotes
" ( \ " [ ^ \ " ] + \ " ) + "
Dates: All the dates in the American format mm/dd/yyyy
" ( 0 [ 1 - 9 ] | 1 [ 0 1 2 ] ) [ - / . ] ( 0 [ 1 - 9 ] | [ 1 2 ] [ 0 - 9 ] | 3 [ 0 1 ] ) [ - / . ]
( 2 0 [ 0 - 9 ] [ 0 - 9 ] | 1 9 [ 0 - 9 ] [ 0 - 9 ] ) "
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 69/96
CHAPTER 5. IMPLEMENTATION 67
Prices: All the numbers followed by the name of a currency (USD,Pound,Euro...)
" [ \ s ] ( [ 0 - 9 ] + ( . [ 0 - 9 ] { 2 } ) ? ) ( [ \ s ] ? ) ( U S D | u s d | d o l l a r s | d o l l a r |
P o u n d s | p o u n d s | E u r o | e u r o s | E u r o s ) "
Numbers: Any real number, with optional decimal point and numbers after the
decimal, and optional positive (+) or negative (-) designation.
" [ \ s ] [ - + ] ? \ d + ( \ . \ d + ) ? [ \ s ]
Time: Times separated by either : or . It will match a 24 hour time, or a 12 hour
time with AM or PM specified. Allows 0-59 minutes, and 0-59 seconds. Seconds
are not required.
" [ \ s ] ( ( ( ( [ 0 ] ? [ 1 - 9 ] | 1 [ 0 - 2 ] ) ) : ( [ 0 - 5 ] [ 0 - 9 ] ) ( : [ 0 - 5 ] [ 0 - 9 ] ) ? ( [ \ s ] ) ?
( A M | a m | a M | A m | P M | p m | p M | P m ) ) | ( ( [ 0 ] ? [ 0 - 9 ] | 1 [ 0 - 9 ] | 2 [ 0 - 3 ] ) :
( [ 0 - 5 ] [ 0 - 9 ] ) ( : [ 0 - 5 ] [ 0 - 9 ] ) ? ) ) "
All the data retrieved are stored in a DataType object which is unique for every
sub-entry and contains all the information about it.
5.3.4 Service Mart Semi-Automatic Generation
The service mart generation we implemented is structured in different phases:
Creation of “artificial” services:
For each service we have to define a new object with all the characteristics a real
one would have. This is the typical structure of a service attribute:
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 70/96
CHAPTER 5. IMPLEMENTATION 68
Group Service Attributes
Time_period_1 day,month,year,date
Time_period_2 hour,minute,time
ListGeography location,zipcode,city,streetname,streetnumber,country
ListJourney (city,city),(location,location),(zipcode,zipcode)
Rating Rating value
Distance Distance value
Characterizing
Parameters Mixed parameters
Table 5.3: List of Groups for the Service Mart Generation
c l a s s S e r v i c e A t t r i b u t e :
v a r I d : S t r i n g = t h e I d o f t h e s e r v i c e
v a r D a t a T e m p : d a t a T e m p l a t e . V a l u e = t h e t y p e o f d a t a e x p e c t e d
v a r S e m a n t i c s : S e m a n t i c A n n o t a t i o n = t h e s e m a n t i c a n n o t a t i o n s y n s e t s + d o m a i n s
v a r T y p : T y p e . V a l u e = t h e e x p e c t e d f o r m a t f o r t h e d a t a i . e . I N T , S T R I N G . . .
v a r m a n d a t o r y : B o o l e a n = a q u a l i t y o f t h e a t t r i b u t e
v a r A t t r i b u t e D i r e c t i o n : D i r e c t i o n . V a l u e = t h e d i r e c t i o n o f t h e a t t r i b u t e
( I N , O U T , I N _ O U T )
And this is the example of an artificial service attribute indicating the “day” con-
cept:
n e w S e r v i c e A t t r i b u t e ( R . n e x t I n t . t o S t r i n g , d a t a T e m p l a t e . d a y , n e w
S e m a n t i c A n n o t a t i o n ( L i s t ( " 1 4 2 9 7 3 9 1 " ) , L i s t ( " t i m e _ p e r i o d " ) ) ,
T y p e . S T R I N G , t r u e , D i r e c t i o n . I N )
We then categorized all the fictitious services in different groups based on their
semantic values:
These groups will be then randomly used to build different access patterns and
assemble the service marts. A special consideration has to be done for “Charac-
terizing Parameters”. These are the parameters that characterize, as their name
says, the meaning of the access pattern. They are not generic parameters, as space
and time, but proper names of persons, titles of books, albums or any other spe-
cific quality a user would like to look for, like the genre of a restaurant or the name
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 71/96
CHAPTER 5. IMPLEMENTATION 69
of a team. This parameters thus are the core of the generation of fictitious service
marts.
The final corpus of Service Marts is then completely semantically annotated and
you can see this in a sample Service Mart extracted from the corpus in the schema
below.
Figure 5.9: A sample of a generated service mart data structure
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 72/96
CHAPTER 5. IMPLEMENTATION 70
5.3.5 Map sub-queries to access patterns
The last step in our analysis is to match sub-queries to appropriate service marts
that can satisfy their search requests. To do this we implemented a complex con-
frontation algorithm. As said earlier we extract from each part a datatype object
which contains different data types, either identified by regular expressions or the
entity recognizer.
The matching happens specifically between the available service attributes con-
tained in the access patterns and the data from the sub-queries. If the matchingis satisfied the service mart from which the access pattern is selected will be a
suitable candidate for the final search. All the candidates service marts are sorted
according to the domain matching score they earned during the mapping of the
sub-queries (see section 4.8).
The mapping For each type we implemented a proper mapping function. Each
type is treated separately and the mapping schema in the figure below is applied
to each DataType/AccessPattern combination. The output of the function is a list
of results containing all the data that will be needed for the following invocation
of the search service: Service Id, data, type format requested. Also the updated
DataType, with all the still available elements for matching, is given in output so
that it can be the input to the following calls to the mapping function.
Figure 5.10: Mapping schema
Since not every type is perfectly recognizable with the tools we used, we decided
to implement an enhanced comparison. We don’t compare only the attributes with
their correspondents belonging to the same type but also, as a backup matching,
with more generic types that can be a suitable match. For instance: if the service
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 73/96
CHAPTER 5. IMPLEMENTATION 71
attribute needs a “Price” type and we can’t find one in the DataType structure
extracted from the sub-query we look for a simple “number” type.
Following there is the detailed summary of every data type comparison and their
backups:
• Number and Words are confronted only with their corresponding types.
• Price: Confronted with the Price type. If not matched they are confronted
with the Number type.
• Date: The Date structure is formed by Day/Month/Year. The comparison
happens between Date types. If Day or Month types are requested singularly
from the services the matching is researched in a Date structure.
• Time: The Time structure is formed by hour/minute. The comparison hap-
pens between Time types. If Hour types are requested singularly from the
services the matching is researched in a Time structure.
• Organization, Location, Person, Title: These types are confronted with theirhomonyms respective types in the datatype structure. If no match is found
they’re confronted with the Word type doing a semantic matching.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 74/96
Chapter 6
Evaluation
In this chapter we’ll evaluate the results extracted from the entries examined in
the Sift application.
6.1 Creation of the corpus of queries and service
marts
We retrieved a great number of entries from Yahoo! Answers to populate our
database and give us a big corpus to test our analyses. Over a period of time,
we collected a series of entries from Yahoo! Answers. The total number of en-
tries analyzed is close to 1200, though 10% of them had to be discarded because
completely unsuitable for our purposes. The corpus thus gives a good idea of theelements found within Yahoo’s database.
The service mart corpus we created consists of approximately 70 artificial marts
we assembled with the algorithm described in 5.3.4. These Service Mart’s annota-
tion cover a great number of domains from the WordNet Domain Hierarchy so that
the matching with words extracted from the sub-entries can be more complete.
72
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 75/96
CHAPTER 6. EVALUATION 73
The total number of Access Patterns is more than 400 and to every Service Mart
we assign a number of 5 to 8 of them. Each one of the Access Patterns is charac-
terized by a maximum number of 7 completely annotated service attributes.
6.2 The Experiment
After the creation of the environment, the definition of all the algorithms and
analysis to test the queries we rated every aspect manually according to the criteria
already defined in chapter 4. We evaluated the entries and the data extracted from
them with a 3-star rating system in three different categories: firstly we rated the
general quality of the entry, then the splitting applications and finally the domain
extraction. We also retrieved some data on the number of sub-entries extracted,
number of domains retrieved and service mart matched.
These ratings are automatically organized in statistics by the program so that we
can analyze more easily the results obtained.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 76/96
CHAPTER 6. EVALUATION 74
6.3 The results and the Evaluation
6.3.1 Entries evaluation
Figure 6.1: The Main Screen of the Sift Application
Most of the entries from Yahoo! Answers we rated had a very low score and thus
they were not corresponding to our needs. A lot of questions had to be pruned
because they contained special characters, misspelled words or other nonsensepunctuation (i.e. “Want 2 find an hotel in tokio—-ASAP!!!!!!!PLS!”). The num-
ber of real multi-domain queries we extracted from our source is very low and
most of those entries are not suitable for our analysis, they don’t contain all the
data required to successfully invoke a service mart. The main reason is that a lot
of parameters are implicit in a question, taking for example this sub-entry:
“I want to find an hotel in Fiji for me and my son for the next week”
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 77/96
CHAPTER 6. EVALUATION 75
A human reading that question can extract some useful information like :
• How many persons are interested in the hotel? parent+son=2
• What is the date of the vacation? next week= date of this week + 7 days
These data is available in the question but due to the limited power of our ex-
traction methods we can’t identify them. That’s why a lot of real multi-domain
question will not have a matching service mart in our algorithm.
Moreover we have to consider the nature of the service Yahoo! Answers. This
service in fact was originally born to give people the possibility to ask questions
to other people, not specifically multi-domain ones. In fact most of the questions
we found were about advices, thoughts about something and the sort of things one
would ask to only another human being and never to an automatic service online;
for example “How’s the hotel X?I want to go there with my kids of 3 and 5 years,
will it be a good choice?” or ”I heard bad things about that neighborhood, is it
really bad? I’m moving there next week”. Both of these questions are completely
unanswerable with any automatic service, the possibility is even that the user who
asked them wants to have multiple opinions on the matter and then decide for
himself; a completely different approach to the service from what we expect from
a user approaching our multi-domain search service.
Another downside of using Yahoo! Answers is the fact that the form of the site
allows the user to give a “title” and then a specific and longer explanation of
the request. We decided to retrieve only the title section because of the useless
elements that can be found in the “text” section as links and attachments. Thisis one of the reason for the great number of low rated entries, because often the
“title”, if it’s comprehensible and structured, refers only to the main topic of the
request that in its corpus can be multi-domain.
We retrieved the approximately 1200 entries but we had to discard some of them
for the reasons stated above. We finally obtained 1064 entries. 759 of them were
rated as one star, which means that they were completely inappropriate for the
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 78/96
CHAPTER 6. EVALUATION 76
purposes of this research, so they were formally wrong or just involved a single
domain. Then we retrieved 221 two-star rated questions that are considerable as
multi-domains but not compatible with existing web services and thus are of low
value. The three-star rated entries were only 84, still a good number to test our
algorithm and analysis.
We then evaluated the entries in every aspect according to the criteria presented in
chapter four. The evaluation of the splitting and of the domain retrieval has been
done only for the entries rated with two or three stars because it would have been
nonsense to evaluate the splitting done on a single domain question.
6.3.2 Splitting Evaluation
Figure 6.2: First Level Split Statistics
Figure 6.3: Clause Level Split Statistics
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 79/96
CHAPTER 6. EVALUATION 77
From the data we extracted we can see that the “clause level split” ratings are
better than the “first level split” ones. This can be explained with a simple reflec-
tion on the tree structure of a typical sentence in the English language. The most
common structure of a standard question/sentence is subject-verb-object. A com-
plement is very likely to be connected to the last element in the principal clause,
that is the object. The same can be said for syntactical periods: the principal
clause is at the beginning of the period and it’s followed by coordinates and sub-
ordinates in the right part of the tree structure. This layout can often unbalance
the tree and make it lean on the right. This can really affect the first level split
which will result in the division of the first sentence from the rest of the period asseen in the example below.
Figure 6.4: Wrong First level Splitting
As for the number of keywords retrieved from the sub-entries we discovered that
approximately the 90% of them has less than 10 keywords from which we can
extract domains in the following tasks. The really low number of extracted key-
words in some entries can affect the information extraction but this is a limit of the
approach we decided to use. In fact due to the impossibility to extract domains
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 80/96
CHAPTER 6. EVALUATION 78
from adverbs and adjectives (they’re not in the WordNet domains database) we
had to discard them and keep only nouns and verbs.
6.3.3 Domain Extraction Evaluation
Figure 6.5: WordNet Domain Statistics
The domain extraction method that uses the WordNet Domain database is efficient
in retrieving all the possible domains a word can relate to although they usually
include far too many generic results, gotten from the secondary senses of the
words. The great number of domains retrieved for each sub-entry (as we can see
in the statistics screenshot) compels us to sort them efficiently to find the most
relevant ones and subsequently find a proper matching. The main problem in
this section is then the sorting of the domains. We used the tf-idf method as the
starting point of our research and we obtained the results listed in fig.6.5.We used
then the bonus method to improve the domain score, described in section 4.5.1,
and we can see from the results that the ratings of the sorting of the domains were
slightly better than the previous elaboration. In the following table a summary of
the statistics is presented:
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 81/96
CHAPTER 6. EVALUATION 79
Figure 6.6: WordNet Domain Statistics Optimized
Tf-Idf Optimized
1 Star 142 110
2 Stars 133 134
3 Stars 23 54
Table 6.1: Domain Extraction Results summary
The main problem found in this section is the too unbalanced and small corpus of
domains available in the database. In the WordNet Domains Hierarchy there are
less than 200 domains (the complete structure can be found in the appendix 3), an
exiguous number compared to the various exigencies of annotation of the entries
and services. Moreover the variety of domains is very unbalanced for examplethe “tourism” domain is categorized under the label “social sciences” and it’s not
detailed in any way, instead the “sport” domain is divided in 29 subcategories that
detail every possible sport discipline. This can really affect the annotation of an
entry or a service, for instance a touristic service can only count on one domain in
the annotation. This reduces a lot its semantic potential in the matching process.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 82/96
CHAPTER 6. EVALUATION 80
6.4 Service Mart Matching Evaluation
Figure 6.7: Service Mart Matching Statistics
From the auto generated statistics we can see that 726 sub-entries have at least
one service mart that matches their data. This result can be considered successful
for the technical aspects of our matching algorithm. A good number of queries
is successfully matched to a service mart that is therefore invokable with everyrequired service parameter. Despite the correctness of the algorithm we can’t
say anything on the semantic correctness and effectiveness of the matching since
we don’t have the actual results but only the name and description of a fictitious
service mart.
6.5 A Complete example of info extraction, splitting
and matching.
To examine properly every detail of the analysis and processing of an entry we
chose a multi-domain question that gave good results in almost every section.
The original question in input:
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 83/96
CHAPTER 6. EVALUATION 81
D o y o u k n o w a f i v e s t a r r a t e d T h a i o r J a p a n e s e r e s t a u r a n t
i n L A o n 0 3 / 2 3 / 2 0 1 0 n i g h t ? T h e n a n e a r m o v i e t h e a t e r w i t h
a n A m e r i c a n f i v e s t a r r a t e d c o m e d y m o v i e ? a f t e r I w a n t 3
n i g h t s i n a f i v e s t a r M a l i b u h o t e l f r o m 0 3 / 2 2 / 2 0 1 0 t o 0 3 / 2 4 / 2 0 1 0 ?
This is the splitting in clauses for the entry.
Figure 6.8: The Clause split of the sample entry
Below you can see the trees of the entry. As you can see every sub-entry is iden-
tified by a different tree, this is due to the punctuation and structure of the entry
that is composed by three questions.
Figure 6.9: The trees of the clause split division
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 84/96
CHAPTER 6. EVALUATION 82
Sub-Entries
DataTypes I II III
Dates 23/03/2010 22/03/2010,24/03/2010
Words star, Thai,
Japanese,
restaurant, LA,
night
movie, theater,
American,
star, comedy, movie
nights,star,hotel
Locations Japanese,LA Malibu
Numbers 3
Table 6.2: The Data Types extracted
These are the data types extracted from each sub-entry:
These are the matching results for each sub-entry:
Figure 6.10: Matching for the sub-entries
In fig 6.10 you can see the results of the matching algorithm. For each sub-entry
we selected the service marts with the highest matching score in terms of domain
affinity. Then we checked for the availability of their required parameters in the
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 85/96
CHAPTER 6. EVALUATION 83
sub-entries and choose the service mart with the highest score that can also be in-
vokable. In the results we thus indicated the sub-entry (in bold black), the service
mart Id and its domain matching score, the matched AP Id (at the bottom) and
all the service attribute Ids with the corresponding data from the sub-entry. To
check the type requirements for every service mart with the type extracted from
the sub-entries we operate manually. We found that in the considered case they
were completely corresponding, meaning the algorithm found perfect match and
didn’t have to resolve to the backup matching listed in 5.3.5.
In the following screen we can see the Ids and scores of the service marts invok-
able by the sub-entries. On the left, under each sub-entry, you can see the service
mart Ids and on the right, in green, the domain matching score for each of them.
Figure 6.11: Service Mart matching for the sub-entries
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 86/96
Chapter 7
Conclusions
7.1 Objectives and Final Evaluation
This thesis project objective was the research and creation of a matching service
that could help the pairing of the natural language queries to the most suitablesearch services, a long and boring task if operated by hand by a single user. In the
research process we first had to enhance and enrich the analysis environment and
implement some autostatistics tools. Using them we validated some techniques
that were previously used to split and retrieve domains and we researched, tested
and evaluated some new approaches for the domain extraction sorting. Then we
extracted, through the use of different tools, the data information from the sub-
queries. Combining the data and domain information we then developed a new
task for the matching with services. Finally we presented the results obtained and
validated the approaches used. The final application presents the complete process
of acquisition, analysis and matching of the entries.
The results obtained from the program indicate that the approaches used were
quite successful in the technical aspects as we explained in the evaluation section
and are a strong base for future testing and development of the tasks researched.
84
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 87/96
CHAPTER 7. CONCLUSIONS 85
7.2 Future Works
The natural language elaboration and analysis process is a very tricky and difficult
one. Even if we tried successfully some algorithms and analysis in this field,
we can be sure there is still a lot of work to be done to make this analysis tool
completely efficient. The analysis of the input and the matching of it to suitable
search services will be the core activity of the SeCo project application and with
this thesis we structured an initial complete approach to the problem that can be
developed and extended.
Splitting This option has been explored through the Stanford Parser approach,
we just considered the structure of the sentence and never the meaning of each
clause, a more in depth research can be done in this field exploiting the typical
structures and patterns of questions to split more efficiently the entries.
Domain Extraction In this section the main downside of the tool we used is the
scarcity and little variety of domains. One possible solution to the problem is to
enhance the available domain database in some way, because the reannotation of
the entire WordNet database would be a too ambitious task if not impossible.
Information Extraction This section can be enhanced with research on richer
regular expressions or entity extraction tools can be tried using the existing pro-
gram structure. In particular more advanced entity extraction tools can be useful
for the identification of different kind of subjects in the sub-entry that can be help-
ful in the domain retrieval section.
Service Mart Matching The algorithm proposed in this thesis project is a com-
plete although a basic one. Other approaches to the problem can be tried using for
instance data mining techniques if we get a good number of annotated services and
queries. The main problem at the moment is the absence of real testable services
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 88/96
CHAPTER 7. CONCLUSIONS 86
that could give us real results on which we could base our evaluation. Also the
multiple domain questions retrieved are a small number for a good and complete
testing so in the future we would need a bigger corpus to validate the techniques
used.
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 89/96
Chapter 8
Appendix
Appendix I - Glossary
Entry/Query - The question entered in the system manually or retrieved from
Yahoo! Answers.
Sub-Entry/Sub-Query - The various sections of the sentence retrieved after the
splitting process
Domain - Conceptual entity that defines the meaning of a word. The only domains
used are the ones belonging to the WordNet Domains Hierarchy.
Multi-Domain - Referred to a query that involves more than one domain in its
subject.
Appendix II - ServiceMart XML
87
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 90/96
CHAPTER 8. APPENDIX 88
< S e r v i c e M a r t >
< ! - - M a r t ' s g l o b a l p r o p e r t i e s - - >
< I d > - 9 7 6 4 0 3 9 4 4 < / I d >
< N a m e / >
< D e s c r i p t i o n / >
< S e m a n t i c A n n o t a t i o n >
< ! - - L i s t o f s y n s e t s a n d d o m a i n s - - >
< s y n s e t s >
< l i > 2 2 6 2 5 < / l i >
< l i > 1 4 2 9 7 3 9 1 < / l i >
< l i > 1 4 3 4 8 1 5 6 < / l i >
< l i > 5 9 4 3 4 8 0 < / l i >
< l i > 4 1 6 7 5 6 1 < / l i >
< l i > 4 8 3 6 1 7 4 < / l i >
< l i > 1 4 3 0 1 4 3 2 < / l i >
< l i > 5 9 5 0 5 0 5 < / l i >
< l i > 4 2 4 7 3 5 5 < / l i >
< l i > 6 2 0 5 4 5 2 < / l i >
< l i > 1 4 3 4 3 0 1 9 < / l i >
< l i > 1 4 3 6 6 7 1 7 < / l i >
< l i > 1 4 3 7 3 5 7 1 < / l i >
< l i > 4 8 0 7 1 8 0 < / l i >
< l i > 5 4 0 3 5 1 8 < / l i >
< l i > 8 0 0 5 4 0 7 < / l i >
< l i > 8 0 2 3 6 6 8 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > a r t < / l i >
< l i > c i n e m a < / l i >
< l i > g e o g r a p h y < / l i >
< l i > q u a l i t y < / l i >
< l i > t h e a t r e < / l i >
< l i > t i m e _ p e r i o d < / l i >
< l i > t v < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< ! - - T h e f i r s t A c c e s s P a t t e r n - - >
< S M P a t t e r n >
< A c c e s s P a t t e r n >
< I d > - 1 3 9 0 4 3 1 3 6 0 < / I d >
< N a m e / >
< D e s c r i p t i o n / >
< ! - - L i s t o f A t t r i b u t e s o f t h e A P - - >
< S e r v i c e A t t r i b u t e >
< I d > - 1 8 7 3 9 6 7 5 9 4 < / I d >
< D a t a T e m p > 1 3 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 5 9 5 0 5 0 5 < / l i >
< l i > 6 2 0 5 4 5 2 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > a r t < / l i >
< l i > t v < / l i >
< l i > c i n e m a < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > 6 2 6 1 7 8 8 0 6 < / I d >
< D a t a T e m p > 1 3 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 4 2 4 7 3 5 5 < / l i >
< l i > 6 2 0 5 4 5 2 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > a r t < / l i >
< l i > t h e a t r e < / l i >
< l i > c i n e m a < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > - 2 1 4 5 3 0 7 5 2 < / I d >
< D a t a T e m p > 3 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 1 4 3 4 3 0 1 9 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > t i m e _ p e r i o d < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > - 4 9 1 8 8 9 1 5 3 < / I d >
< D a t a T e m p > 8 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 1 4 3 6 6 7 1 7 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > t i m e _ p e r i o d < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 91/96
CHAPTER 8. APPENDIX 89
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > 1 4 0 5 8 2 4 4 6 5 < / I d >
< D a t a T e m p > 9 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 1 4 3 7 3 5 7 1 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > t i m e _ p e r i o d < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > - 8 7 5 6 1 5 3 3 < / I d >
< D a t a T e m p > 5 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 4 8 0 7 1 8 0 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > g e o g r a p h y < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > - 1 9 2 5 6 2 5 9 8 0 < / I d >
< D a t a T e m p > 6 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 8 0 2 3 6 6 8 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > g e o g r a p h y < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > - 4 5 2 9 0 8 4 9 < / I d >
< D a t a T e m p > 6 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 2 2 6 2 5 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > g e o g r a p h y < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > - 4 5 2 9 0 8 4 9 < / I d >
< D a t a T e m p > 6 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 2 2 6 2 5 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > g e o g r a p h y < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< / A c c e s s P a t t e r n >
< / S M P a t t e r n >
< A P >
< A c c e s s P a t t e r n >
< I d > - 1 3 9 0 4 3 1 3 6 0 < / I d >
< N a m e / >
< D e s c r i p t i o n / >
< S e r v i c e A t t r i b u t e >
< I d > - 1 8 7 3 9 6 7 5 9 4 < / I d >
< D a t a T e m p > 1 3 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 5 9 5 0 5 0 5 < / l i >
< l i > 6 2 0 5 4 5 2 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > a r t < / l i >
< l i > t v < / l i >
< l i > c i n e m a < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > 6 2 6 1 7 8 8 0 6 < / I d >
< D a t a T e m p > 1 3 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 92/96
CHAPTER 8. APPENDIX 90
< s y n s e t s >
< l i > 4 2 4 7 3 5 5 < / l i >
< l i > 6 2 0 5 4 5 2 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > a r t < / l i >
< l i > t h e a t r e < / l i >
< l i > c i n e m a < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > - 2 1 4 5 3 0 7 5 2 < / I d >
< D a t a T e m p > 3 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 1 4 3 4 3 0 1 9 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > t i m e _ p e r i o d < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< S e r v i c e A t t r i b u t e >
< I d > - 4 9 1 8 8 9 1 5 3 < / I d >
< D a t a T e m p > 8 < / D a t a T e m p >
< S e m a n t i c A n n o t a t i o n >
< s y n s e t s >
< l i > 1 4 3 6 6 7 1 7 < / l i >
< / s y n s e t s >
< d o m a i n s >
< l i > t i m e _ p e r i o d < / l i >
< / d o m a i n s >
< / S e m a n t i c A n n o t a t i o n >
< T y p > 3 < / T y p >
< m a n d a t o r y > t r u e < / m a n d a t o r y >
< A t t r i b u t e D i r e c t i o n > 0 < / A t t r i b u t e
D i r e c t i o n >
< / S e r v i c e A t t r i b u t e >
< / A c c e s s P a t t e r n >
< / A P >
< / S e r v i c e M a r t >
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 93/96
CHAPTER 8. APPENDIX 91
Appendix III - WordNet Hierarchy
TOP LEVEL
doctrines
free_time
applied_science
pure_science
social_science
factotum
number
color
time_period
personquality
metrology
HIERARCHY: DOCTRINES
archaeology
astrology
history
• heraldry
linguistics
• grammar
literature
• philology
philosophy
psychology
• psychoanalysis
art
• dance
• drawing
– painting
– philately
• music
• photography
• plastic_arts
– jewellery
– numismatics
– sculpture
• theatre
religion
• mythology
• occultism
• theology
archaeology
HIERARCHY: FREE_TIME
free_time
play
• betting
• card
• chess
sport
• badminton
• baseball
• basketball
• cricket
• football
• golf
• rugby
• soccer
• table_tennis
• tennis
• volleyball
• cycling
• skating
• skiing
• hockey
• mountaineering
• rowing
• swimming
• sub
• diving
• athletics
• wrestling
• boxing
• fencing
• archery
• fishing
• hunting
• bowling
• racing
HIERARCHY:
APPLIED_SCIENCE
applied_science
agriculture
alimentation
• gastronomy
architecture
• town_planning
• building_industry
• furniture
computer_science
engineering
• mechanics
– astronautics
– electrotechnics
– hydraulics
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 94/96
CHAPTER 8. APPENDIX 92
medicine
• dentistry
• pharmacy
• psychiatry
• radiology
• surgery
veterinary
• zootechnics
HIERARCHY: PURE_SCIENCE
astronomy
• topography
biology
• biochemistry
• ecology
• botany
zoology
• entomology
anatomy
physiology
genetics
chemistry
earth
• geology
• meteorology
• oceanography
• paleontology
• geography
mathematics
• geometry
physics
• acoustics
• atomic_physic
• electricity
• optics
HIERARCHY:
SOCIAL_SCIENCE
administration
anthropology
• ethnology
– folklore
artisanship
body_care
commerce
economy
• banking
• book_keeping
• enterprise
• exchange
• insurance
• money
• tax
fashion
industry
law
• state
military
pedagogy
• school
• university
politics
• diplomacy
publishing
sexuality
sociology
telecommunication
• cinema
• post
• radio
• telegraphy
• telephony
• tv
tourism
transport
• aeronautic
• auto
• merchant_navy
• railway
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 95/96
Bibliography
[1] Luisa Bentivogli, Pamela Forner, Bernardo Magnini and Emanuele Pianta.
"Revising WordNet Domains Hierarchy: Semantics, Coverage, and Balanc-
ing". In Proceedings of COLING 2004 Workshop on "Multilingual Linguis-
tic Resources", Geneva, Switzerland, August 28, 2004, pp. 101-108.
[2] Jenny Rose Finkel , Trond Grenager , Christopher Manning, Incorporating
non-local information into information extraction systems by Gibbs sam-
pling, Proceedings of the 43rd Annual Meeting on Association for Compu-
tational Linguistics, p.363-370, June 25-30, 2005, Ann Arbor, Michigan
[3] Erik F. Tjong Kim Sang , Fien De Meulder, Introduction to the CoNLL-2003
shared task: language-independent named entity recognition, Proceedings of
the seventh conference on Natural language learning at HLT-NAACL 2003,
p.142-147, May 31, 2003, Edmonton, Canada
[4] R. Mandala, T. Takenobu, and T. Hozumi. The use of Word.Net in informa-
tion retrieval. In COLING/ACL Workshop on Usage of WordNet in Natural
Language Processing, Systert, 1998.
[5] Magnini, B. and C. Strapparava. 2000. Experiments in word domain disam-
biguation for parallel texts. In ACL-2000 Workshop on Word Sense and Mul-
tilinguality. Association for Computational Linguistics, New Brunswick, NJ.
[6] Unsupervised Domain Relevance Estimation for Word Sense Disambigua-
tion Alfio Gliozzo and Bernardo Magnini and Carlo Strapparava ITC-irst,
Istituto per la Ricerca Scientifica e Tecnologica, I-38050 Trento, ITALY
93
8/2/2019 Matching Natural Language Multi Domain Queries to Search Service
http://slidepdf.com/reader/full/matching-natural-language-multi-domain-queries-to-search-service 96/96
BIBLIOGRAPHY 94
[7] B. Magnini, C. Strapparava, G. Pezzulo, A. Gliozzo. "The Role of Domain
Information in Word Sense Disambiguation", Natural Language Engineer-
ing, special issue on Word Sense Disambiguation, 8(4), pp. 359-373, Cam-
bridge University Press, 2002
[8] E. Minkov, R. C. Wang, and W. W. Cohen. Extracting personal names
from emails: Applying named entity recognition to informal text. In HLT-
EMNLP, 2005.
[9] Li, Y.; Moffat, A.; Stokes, N. & Cavedon, L. Exploring Probabilistic To-
ponym Resolution for Geographical Information Retrieval. In 3rd Workshop
on Geographic Information Retrieval (GIR 2006). Seattle, WA,USA, 2006.
17–22.
[10] Karen Spärck Jones. A statistical interpretation of term specificity and its
application in retrieval. Journal of Documentation, 28:11–21, 1972.
[11] Fox, E.A., Neves, F.D., Yu, X., Shen, R., Kim, S. and Fan, W. Exploring
the computing literature with visualization and stepping stones & pathways.
CACM 49(4): 52-58, 2006.
[12] C. Zhang, N. Sun, X. Hu, T. Huang, and T.-S. Chua. Query segmentation
based on eigenspace similarity. In Proceedings of the ACL-IJCNLP 2009
Conference, pages 185–188, Suntec, Singapore, August 2009.
[13] Bin Tan , Fuchun Peng, Unsupervised query segmentation using generative
language models and wikipedia, Proceeding of the 17th international confer-
ence on World Wide Web, April 21-25, 2008, Beijing, China