a prototype for semantic querying and evaluation of ...text-to-sql transformations [22, 33], or...

32
Lebanese American University School of Engineering Electrical & Computer Engineering Department COE594 Undergraduate Research Project Final Report A Prototype for Semantic Querying and Evaluation of Textual Data Christian G. Kallas Supervisor: Dr. Joe Tekli Byblos, April 2017

Upload: others

Post on 26-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

Lebanese American University School of Engineering

Electrical & Computer Engineering Department

COE594 Undergraduate Research Project

Final Report

A Prototype for Semantic Querying and

Evaluation of Textual Data

Christian G. Kallas

Supervisor: Dr. Joe Tekli

Byblos, April 2017

Page 2: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

2

Abstract This undergraduate research project addresses the problem of semantic-aware search in textual SQL

databases. This problem has emerged as a required extension of the standard containment keyword-

based query to meet user needs in textual databases and IR applications. We build on top of a semantic-

aware inverted index called SemIndex, previously developed by our peers, to allow semantic-aware

search, result selection, and result ranking functionality. Various weighting functions and search

algorithms have been developed for that purpose. An easy to use graphical interface was also designed to

help testers write and execute their queries. Preliminary experiments reflect the effectiveness and

efficiency of our solution.

Page 3: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

3

Table of Contents Abstract ......................................................................................................................................................... 2

1 Introduction ............................................................................................................................................... 6

1.1 Context .......................................................................................................................................... 6

1.2 Goal and Contributions ................................................................................................................. 6

2 Background ................................................................................................................................................ 7

2.1 Literature Overview ............................................................................................................................ 7

2.2 SemIndex Logical Model ..................................................................................................................... 8

2.3 SemIndex Physical Implementation .................................................................................................. 10

3 SemIndex Weight Functions .................................................................................................................... 11

3.1 Index Node Weight ........................................................................................................................... 11

3.2 Index Edge Weight ............................................................................................................................ 11

3.3 Data Node Weight ............................................................................................................................ 12

3.4 Data Edge Weight ............................................................................................................................. 12

3.5 Relevance Scoring Measure .............................................................................................................. 13

4 Querying Algorithms ................................................................................................................................ 13

4.1 Generic Solutions .............................................................................................................................. 13

4.1.1 Legacy Inverted Index Search .................................................................................................... 13

4.1.2 Query Relaxation ........................................................................................................................ 14

4.1.3 Query Disambiguation ............................................................................................................... 14

4.2 SemIndex Querying Algorithms ........................................................................................................ 14

4.2.1 SemIndex Search ........................................................................................................................ 14

4.2.2 SemIndex Disambiguated .......................................................................................................... 15

4.2.3 SemIndex Query As You Type .................................................................................................... 15

5 Prototype Functionality ........................................................................................................................... 15

5.1 Interface manipulation and customization ....................................................................................... 15

5.2 Range and KNN Search Operators .................................................................................................... 18

5.3 Relevant Document Generator ......................................................................................................... 18

6 Experimental Evaluation .......................................................................................................................... 20

6.1 Experimental Metrics ........................................................................................................................ 20

6.1.1 Quality Precision, Recall and F-Value ......................................................................................... 20

6.1.2 Ranking: Mean Average Precision.............................................................................................. 20

6.1.3 Ordering: using Kendall’s Tau .................................................................................................... 21

Page 4: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

4

6.2 Preliminary Results ........................................................................................................................... 23

6.2.1 Querying Effectiveness .............................................................................................................. 24

6.2.2 Querying Efficiency .................................................................................................................... 27

7 Conclusion ................................................................................................................................................ 29

7.1 Summary of Contribution ................................................................................................................. 29

7.2 Ongoing Works .................................................................................................................................. 29

Page 5: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

5

List of Figures

Figure 1- SemIndex Framework (from [7]) .................................................................................................... 6

Figure 2 Sample inverted index (a) and corresponding SemIndex graph (b), based on the textual

collection Δ in Table 1. .................................................................................................................................. 8

Figure 3 - Extract from the knowledge graph of WordNet, with corresponding inverted index. ................ 9

Figure 4 SemIndex graph obtained after coupling the data collection and the knowledge base graphs in

Figure 5 SemIndex physical design. ............................................................................................................ 10

Figure 6 Sample node linkage in SemIndex graph ...................................................................................... 13

Figure 9 - Query as you go meaning chooser panel .................................................................................... 18

Figure 10 - Relevant Document Generator ................................................................................................. 19

Figure 11 Comparing precision (PR) and recall (R) results obtained using SemIndex versus legacy

InvIndex, with query group Q1 (unrelated queries), varying the number of query terms k and link

................................................................................................. 26

Figure 12 Comparing precision (PR) and recall (R) results obtained using SemIndex versus legacy

InvIndex, with query group Q2 (expanded queries), varying the number of query terms k and link

................................................................................................. 26

Figure 13 Comparing f-value levels obtained using SemIndex versus legacy InvIndex, with query group

Q1 (unrelated queries), and query group Q2 (expanded queries) ............................................................. 26

Figure 14 Comparison with legacy InvIndex query execution time, considering queries of group Q1

(unrelated queries), .................................................................................................................................... 28

Figure 15 Comparing SemIndex and legacy InvIndex query execution time, considering queries of group

Q1 (unrelated queries), wh

number of query terms k. ........................................................................................................................... 28

List of Tables

Table 1. Sample Movie data collection extracted from IMDB , titled Δ. ...................................................... 8

Table 2 - Test queries .................................................................................................................................. 23

Page 6: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

6

1 Introduction

1.1 Context Processing keyword-based queries is a fundamental problem in the domain of Information

Retrieval (IR) [12, 15, 16]. . A standard containment keyword-based query, which retrieves textual

identities that contain a set of keywords, is generally supported by a full-text index. Inverted

index is considered as one of the most useful full-text indexing techniques for very large textual

collections [16], supported by many relational DBMSs, and then extended toward semi-structured

and unstructured data to support keyword-based queries.

In a previous study [7], a group of colleagues developed SemIndex: a semantic-aware inverted

index model designed to process semantic-aware queries. An extended query model with

different levels of semantic awareness was defined, so that both semantic-aware queries and

standard containment queries can be processed within the same framework. Figure 1 illustrates

the overall framework of the SemIndex approach and its main components. Briefly, the Indexer

manages the index generation and maintenance, while the Query Processor processes and

answers semantic-aware (or standard) queries issued by the user using SemIndex component.

Figure 1- SemIndex Framework (from [7])

1.2 Goal and Contributions While the study in [7] focused on the index design model of SemIndex, the goal of our current

project is to complete the design and development of SemIndex’s query evaluation engine. To

this end, our contributions include: i) a dedicated weight functions, associated with the different

elements of SemIndex, allowing semantic query result selection and ranking, coupled with iii) a

dedicated relevance scoring measure, required in the query evaluation process in order to

retrieve and rank relevant query answers, iii) three alternative query processing algorithms (in

addition to the main algorithm), iv) a dedicated GUI interface allowing user to easily manipulate

the prototype system, and iv) an experimental study valuating SemIndex’s querying effectiveness

and efficiency.

Page 7: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

7

2 Background

2.1 Literature Overview Early approaches on keyword search queries for RDBs uses traditional IR scores (e.g., TF-IDF) to

find ways to join tuples from different tables in order to answer a given keyword query [1, 14,

20]. The proposed search algorithms focus on enumeration of join networks called candidate

networks, to connect relevant tuples by joining different relational tables. The result for a given

query comes down to a sequence of candidate networks, each made of a set of tuples containing

the query keywords in their text attributes, and connected through their primary-foreign key

references, ranked based on candidate network size and coverage. More recent methods on RDB

full-text search in [24, 25] focus on more meaningful scoring functions and generation of top-k

candidate networks of tuples, allowing to group and/or expand candidate networks based on

certain weighting functions in order to produce more relevant results. The authors in [27] tackle

the issue of keyword search on streams of relational data, whereas the approach in [39]

introduces keyword search for RDBs with star-schemas found in OLAP applications. Other

approaches introduced natural language interfaces providing alternate access to a RDB using

text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying

entities) from text (e.g., Web documents) and storing it in a DBMS to simplify querying [9, 10].

Keyword-based search for other data models, such as XML [17, 19] and RDF [3, 4] have also been

studied.

More recent approaches have developing so-called semantic-aware or knowledge-aware

(keyword) query systems, which have emerged since the past decade as a natural extension to

traditional containment queries, encouraged by (non-expert) user demands. Most existing works

in this area have incorporated semantic knowledge at the query processing level, to: i) pre-

process queries using query rewriting/relaxation and query expansion [5, 11, 28], ii) disambiguate

queries using semantic disambiguation and entity recognition techniques [5, 23, 32], and/or iii)

post-process query results using semantic result organization and re-ranking [32, 35, 38]. Yet,

various challenges remain unsolved, namely: i) time latencies when involving query pre-

processing and post-processing [11, 28], ii) complexity of query rewriting/relaxation and query

disambiguation requiring context information (e.g., user profiles or query logs) which is not

always available [13, 37], and iii) limited user involvement, where the user is usually constrained

to providing feedback and/or performing query refinement after the first round of results has

been provided by the system [6, 31].

Our work is complementary to most existing DB search algorithms in that our approach

extends syntactic keyword-term matching: where only tuples containing exact occurrences of the

query keywords are identified as results, toward semantic based keyword matching: where tuples

containing terms which are lexically and semantically related to query terms are also identified as

potential results, a functionality which - to our knowledge - remains unaddressed in most existing

DB search algorithms. We build on an adapted index structure able to integrate and extend

textual information with domain knowledge (not only at the querying level, but rather) at the

Page 8: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

8

most basic data indexing level, providing a semantic-aware inverted index capable of supporting

semantic-based querying, and allowing to answer most challenges identified above.

2.2 SemIndex Logical Model

SemIndex logical design consists of a semantic knowledge graph structure, which consists of

combining two graph representations, for each of the input resources: i) textual collection (e.g.,

IMDB) and ii) reference knowledge base (WordNet) into a single SemIndex graph structure.

Consider for instance the sample data collection in Table 1.

Table 1. Sample Movie data collection extracted from IMDB , titled Δ.

ID Textual content

O1 When a Stranger Calls (2006): A young high school student babysits for a very rich family.

She begins to receive strange phone calls threatening the children...

O2 Days of Thunder (1990): Cole Trickle is a young racer from California with years of

experience in open-wheel racing winning championships in Sprint car racing…

O3 Sound of Music, The (1965): Maria had longed to be a nun since she was a young girl, yet

when she became old enough discovered that it wasn’t at all what she thought...

Term Object IDs[ ]

“car” O1, O2

“light” O1

“sound” O3

“zen” O1

… …

a. Inverted index InvIndex(Δ).

b. SemIndex graph representing InvIndex(Δ).

Figure 2 Sample inverted index (a) and corresponding SemIndex graph (b), based on the textual collection Δ in elbaT 1.

Error! Reference source not found. shows an extract from an inverted index built on the sample

movie database in Error! Reference source not found., where data objects O1, O2, and O3 have

been indexed using index terms extracted from the database, sorted in alphabetic order. It is

important to note that this simple index is typically used to answer containment queries [41],

aiming at finding data objects that contain one or more terms. When a keyword query mapping

two or more index terms must be processed, the corresponding posting lists are read and

merged. The index terms and their mappings with the data objects can be generated using

classical Natural Language Processing (NLP) techniques (including stemming, lemmatization, and

stop-words removal) [30], which could be either embedded in the DBMS or supplied by a third-

party provider.

Data node Index node Contained relationship

Page 9: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

9

An extract from the WordNet ontology is shown in Error! Reference source not found., where S1,

S2 and S3 represent senses (i.e., synsets), and their string values (i.e., the synsets’

glosses/definitions), and T1, T2, …, T11 represent terms, and their string values (i.e., literal

words/expressions) shown alongside the nodes.

a. Sample graph representing a KB extract from WordNet.

Term Sense IDs[]

“acid” S1, S3

“clean” S2

“light” S2

“lsd” S3

“lysergic” S1, S3

… …

b. Extract of inverted index

InvIndex(GKB) connecting terms in

GKB with corresponding senses (to

speed up term/synset lookup)

Figure 3 - Extract from the knowledge graph of WordNet, with corresponding inverted index.

A sample SemIndex graph is shown in Fig. 4 built based on the textual collection Δ from Error! Reference source not found. and the KB extract in Error! Reference source not found.. It comprises 3 data nodes (O1 – O3), 3 index sense nodes (S1 – S3), and 11 index term nodes (T1 – T11) along with data and index edges.

a. SemIndex graph before removing edge labels and string values.

b. Final SemIndex graph representation.

Figure 4 SemIndex graph obtained after coupling the data collection and the knowledge base graphs in Figure 3.

Page 10: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

10

2.3 SemIndex Physical Implementation Fig. 5.a shows the ER conceptual diagram of the SemIndex data graph, whereas Fig.5.b depicts

the data coverage of each relation in the resulting RDB schema. The relations are described

separately in the following subsections.

2.3.1. Data Index

The DDL1 statements for creating relation DataIndex is shown below:

CREATE TABLE DataIndex(objectid INT PRIMARY KEY, weight DECIMAL);

Relation DataIndex stores (in attribute DataIndex.objectid) the identifiers of data objects from the

data collection, represented as data nodes in SemIndex (cf. Fig. 5) along with data node weights

(e.g., an object rank score, stored in attribute DataIndex.weight, which is computed and then

updated during query processing, cf. Section 3).

2.3.2. Lexicon

The DDL statement for creating the Lexicon relation is shown below:

CREATE TABLE Lexicon (nodeid INT PRIMARY KEY, value VARCHAR, weight DECIMAL);

Relation Lexicon stores the lexicon of the knowledge base (KB) used to index the data collection

(Δ), i.e. the set of all index nodes in SemIndex, (cf. Fig. 5). Searchable term nodes, are stored in

their lemmatized form (in attribute Lexicon.value) along with corresponding node identifiers (e.g.,

WordNet node identifiers, stored in Lexicon.nodeid), whereas sense nodes, have null values (in

attribute Lexicon.value). Index node weights are then computed and dynamically updated during

query processing, stored in attribute Lexicon.weight.

a. Conceptual ER model describing SemIndex

physical design.

b. Data representation of each relation in the resulting RDB schema.

Figure 5 SemIndex physical design.

1 Data Definition Language.

Entity Relationship Attribute

DataIndex

Lexicon

Posting List

Neighbors

Page 11: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

11

2.3.3. Posting List

The DDL statement for creating the PostingList relation is shown below:

CREATE TABLE PostingList(nodeid INT, objectid INT, weight DECIMAL, PRIMARY KEY(nodeid, objectid) );

Relation PostingList stores the inverted index of the textual collection (Δ), which comes down to

data edges in SemIndex. PostingList results from joining relations Lexicon with DataIndex, linking

data nodes (PostingList.objectid) with corresponding searchable term nodes

(PostingList.nodeid), each with its corresponding data edge weight (e.g., term frequency score).

PostingList is clustered on attribute nodeid, and for each nodeid, the posting list is sorted on

objectid to optimize search time.

2.3.4. Concepts/Terms Links

The DDL statement for creating the links between terms and concepts, in a Neighbors relation, is shown below:

CREATE TABLE Neighbors(id INT PRIMARY KEY, node1id INT, node2id INT, relationship VARCHAR, weight DECIMAL);

Relation Neighbors stores all index edges in SemIndex linking index nodes (stored as pairs of index

node identifiers in attributes Neighbors.node1id and Neighbors.node2id), including: term-to-

term, term-to-sense and sense-to-sense relationships, along with index edge labels (stored in

Neighbors.relationship) and corresponding index edge weights (stored in Neighbors.weight).

When using WordNet, the label of the relationship includes 28 possible lexical/semantic

relationship types (e.g., hypernym, hyponym, meronym, related-to, etc.), as well as the has-

sense/has-term introduced to explore WordNet term nodes relations. An id attribute was added

since several edges can exist between two index nodes.

3 SemIndex Weight Functions

3.1 Index Node Weight The index node weight is basically computed on the index nodes directly. It is computed

according to the below formula where we consider ‘Fan -in’ to be the number of incoming nodes.

Regardless of each incoming nodes’ respective importance, in determining the importance of

index node vi.

WIndex Node =

( ) [0,1]

( ( ))j index

i

jv V

Fan in v

Max Fan in v

(1)

The rational here becomes: a node is more important if it receive more links from other nodes.

3.2 Index Edge Weight

Given an index edge jie incoming from index node vi and outgoing toward index node vj, we can

define:

Page 12: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

12

WIndex Edge (j

ie )= 1 ]0,1]( )Label iFan out v

(2)

Thus the weight increases with the number of outgoing links from a certain index node to

another, taking into account the semantic relation type describing the index link at hand.

3.3 Data Node Weight Data node weight is defined as follows:

WData Node (ni) =

SemIndex

( ) [0,1]

Max( ( G ))i

j

Fan - In n

Fan - In n

(3)

Where Fan-In(ni) designates the number of foreign key/primary key data links (joins) outgoing

from data nodes (tuple) where the foreign keys reside, toward data node (tuple) ni where the

primary key resides. The rational followed is a PageRank-like score counting the number and

quality of FK/PK links to a data node to determine a rough estimate of how important the data

node is. The underlying assumption is that more important data nodes are likely to receive more

FK/PK links from other data nodes.

3.4 Data Edge Weight TF-IDF score where TF underlines the frequency (number of occurrences) of the index node label

within a given data node, connected via the data edge in question, and IDF underlines the

number of data edges connecting the same index node to data nodes, i.e., the fan-out of the

index node in question. Given a data edge jie incoming from index node vi toward data node

nj, we defined:

WData Edge (j

ie )= TF(vi.l, nj) IDF(vi.l) (4)

Where TF is calculated according to the below function:

TF(vi.l, nj) = i

k kv e

NbOcc(v .l) [0, 1]

( NbOcc(v .l))ik

Max

(5)

N being the total number of data nodes in GSemIndex and DF(vi.l, GSemIndex) is the number of

data nodes in GSemIndex containing at least one occurrence of of vi. In order to have an IDF

[0, 1], we can redifine the formula as follows:

IDF(vi.l, GSemIndex) = ( . , )

1 [0, 1[i SemIndexDF v G

N

(6)

Page 13: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

13

3.5 Relevance Scoring Measure The scores of data nodes returned as query answers are computed on typical Diskstra-style

shortest distance computations. Yet, instead of identifying the shortest (smallest) distance, we

will be identifying the maximum similarity (similarity being the inverse function of distance).

𝑊𝑒𝑖𝑔ℎ𝑡(𝑛𝑖, 𝑛𝑙) = {

𝑊𝐷𝑎𝑡𝑎𝑁𝑜𝑑𝑒(𝑛𝑙) 𝑥 WDataEdge(𝑒𝑘

𝑙 ) 𝑥 1

𝑑(𝑛𝑘,𝑛𝑙)

+ 𝑊𝐼𝑛𝑑𝑒𝑥𝑁𝑜𝑑𝑒(𝑛𝑘) 𝑥 𝑊𝐼𝑛𝑑𝑒𝑥𝐸𝑑𝑔𝑒(𝑒𝑗𝑘) 𝑥 1

𝑑(𝑛𝑘,𝑛𝑙)

+ 𝑊𝐼𝑛𝑑𝑒𝑥𝑁𝑜𝑑𝑒(𝑛𝑗) 𝑥 𝑊𝐼𝑛𝑑𝑒𝑥𝐸𝑑𝑔𝑒(𝑒𝑖𝑗) 𝑥 1

𝑑(𝑛𝑖,𝑛𝑙)𝑑(𝑛𝑖, 𝑛𝑙)

(7)

Where d(ni, nj) is the distance in number of edges between two nodes. In other words, in the

following example, d(nk, n ) = 1, d(nj, n ) = 2, and d(ni, n ) = 3

Figure 6 Sample node linkage in SemIndex graph

The previous example consists of a single term query (i.e., starting from a single starting index

node ni). When processing a multi-keyword query (starting from multiple starting index nodes: ni,

no, np, etc.), the data node will be assigned a weight vector of the form: <weight(ni, n ), weight(no,

n ), weight(np, n )>.

To get single weight score for the data node considering all starting index index, we can use the

average function to combine the individual weights.

, } n { ,

score ( ( , ))d l

o pin n n

n average weight n n

(8)

4 Querying Algorithms We considered different querying algorithms in our solution: i) three generic solutions adapted from the

literature (Legacy Inverted Index search, Query Relaxation and Query Disambiguation), and ii) two

solutions we designed specifically for SemIndex (SemIndex Search, SemIndex Disambiguated, and Query

As You Type). We describe the different algorithms in the following subsections.

4.1 Generic Solutions 4.1.1 Legacy Inverted Index Search

It's a standard containment keyword-based query that:

Page 14: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

14

1- Retrieves textual identities that contain a set of keywords

2- Queries the inverted (term, objectIDs[]) list with every term in the query, to indentify

objects IDs associated to all (or at least one) query terms (based on the query type at

hand: conjunctive or disjunctive)

3- Assigns a score to every potential query answer (data object) considering a predefined

relevance ordering scheme, in order to return the results ranked by their order of scores

in ascending order.

Legacy inverted index search is supported by many relational DBMSs, and has been extended to

support keyword-based queries on semi-structured and unstructured data

4.1.2 Query Relaxation Algorithms Steps:

1- Perform Part-Of-Speech tagging

2- Select most common sense for each token (based on WordNet’s usage frequency,

computed based on the Brown text corpus)

3- For each selected sense si, include in the query: synonymous terms in the synset that is

si, as well as the synonymous terms of all senses included in the direct semantic context

of si, i.e., senses that are related to si via a semantic relationship (e.g., hypernymy,

hyponymy, meronymy, etc.)

4- Run the resulting query as a traditional keyword containment query on the data

collection’s inverted index (syntactic processing)

4.1.3 Query Disambiguation

Algorithms Steps:

1- Perform WSD on query terms using the Simplified (Adapted)

2- For each of the identified senses, include in the query: synonymous terms\

3- Run the resulting query as a traditional keyword containment query on the data

collection’s inverted index (syntactic processing)

4.2 SemIndex Querying Algorithms

4.2.1 SemIndex Search

Algorithm steps:

1- Identifies the index (searchable term) nodes mapping to each query term 2- Identifies, for each of the selected index nodes, the minimum distance paths at

distance using Dijkstra’s shortest path algorithm 3- Identifies the shortest paths which contain data edges linking to data nodes, and

then adds the resulting data nodes to the list of output data nodes, 4- Merges the resulting data nodes with the list of existing answer data nodes. Each

answer node is then assigned a score by adding its distance from every query

Page 15: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

15

term index node. The algorithm finally returns the list of answer data nodes ranked by order of path scores in ascending order.

4.2.2 SemIndex Disambiguated

Algorithms Steps:

1- Perform WSD on query terms using the Simplified (Adapted)

In SemIndex prototype, the simplified Lesk algorithm is used

2- Identifying in SemIndex the index nodes of disambiguated senses

3- Run the resulting query, starting from the identified index nodes, as a keyword

containment query on SemIndex

4.2.3 SemIndex Query As You Type

Algorithms Steps:

1- Let the user choose the sense of each term in the query according to Wordnet

2- Identifying in SemIndex the index nodes of the chosen senses

3- Run the resulting query, starting from the identified index nodes, as a keyword

containment query on SemIndex

5 Prototype Functionality

5.1 Interface manipulation and customization The interface was upgraded to handle multiple options for customized querying in a way that if

we have two options Option 1 and Option 2 and each have multiple variables such as a1-b1-c1

and a2-b2-c2 respectively and in Option 1 a1 and c1 were selected while in Option 2 a2 and b2

were selected the queries that are executed are the combinations of these selections such as:

Query 1 with a1 a2

Query 2 with a1 b2

Query 3 with c1 a2

Query 4 with c1 b2

Generated data sheets (Excel Files) will have the options in the titles in order to differentiate

between the results i.e : NAME with C1 and A2

In Figure 1 below the querying interface is shown; you can see the multiple options such as:

Query Types, Link Distance (Only shown for SemIndex related query types or hidden), Query

Relaxation Multiplier (Shown only if Query Relaxation is chosen), Query Options and Results.

Moreover the Text Field where the queries are being written now highlights next to each line he

number of the query indicating that each line is a separate query.

Page 16: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

16

Besides the multi-querying approach, Error Handling (Input Missing/Wrong Combination etc...)

was developed along with hiding and showing some options depending on the selection to

facilitate and highlight each option in example Link Distance is only shown for SemIndex related

queries and the Quality Results in Results makes both the Choose Relevant Data File and

Generate Relevant Data File appear etc.

The prototype was also customized and developed for specific searching methods mainly the

Query As You Type since it requires the user input (as described in 4.2.3), Figure 8 and 9 describe

the process.

Figure 7 - Querying Panel

Page 17: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

17

Figure 8 - Query as you go panel

Page 18: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

18

Figure 7 - Query as you go meaning chooser panel

5.2 Range and KNN Search Operators In the prototype shown in Figure 1 and 2, the results of a certain query are normally shown in an

Excel sheet. The numbers of results for a certain query might be 185 where each of them have a

certain score.

KNN was developed and implemented to introduce a threshold for the returning results i.e We

would like the top 25 results from the 185 and thus KNN is set to 25.

Range on the other hand deals with the score, being the computational combinations of weights

leading to the specific result. Hence a range of 0.5 would return all the results having a score

equal or larger than 0.5 and disregard the results less than 0.5. Range is a strong option that gives

the ability of only showing very relevant results i.e Range 0.8

Due to the nature of the prototype Range and KNN can be both used at the same time i.e: We

would like the top 25 results from the 185 having at least a score of 0.6

Shown in Figure 2 both KNN and Range are inputted as simple numbers on the bottom of the

Querying Panel

5.3 Relevant Document Generator The Relevant Document Generator was created and developed in order to facilitate performing

experiments. With the tool developed the user can add IDs directly or search in the database

using keywords and then selection and choosing multiple IDs at the same time. Then the user will

be able to switch between IDs order or remove some IDs before generating the CSV file that will

be used in computations such as Precision, Recall, F-Value, MAP and Kendall’s Tau. The figure

shows the customized interface for the Relevant Document Generator.

Page 19: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

19

Figure 8 - Relevant Document Generator

Page 20: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

20

6 Experimental Evaluation

6.1 Experimental Metrics

6.1.1 Quality Precision, Recall and F-Value

Recall and precision are more useful measures that have a fixed range and are easy to compare

across queries and engines. Precision is a measure of the usefulness of a hitlist; recall is a

measure of the completeness of the hitlist.

Recall is defined as:

recall = # relevant hits in hitlist / # relevant documents in the collection

Recall is a measure of how well the engine performs in finding relevant documents. Recall is

100% when every relevant document is retrieved. In theory, it is easy to achieve good recall:

simply return every document in the collection for every query! Therefore, recall by itself is not a

good measure of the quality of a search engine.

Precision is defined as:

precision = # relevant hits in hitlist / # hits in hitlist

Precision is a measure of how well the engine performs in not returning nonrelevant documents.

Precision is 100% when every document returned to the user is relevant to the query. There is no

easy way to achieve 100% precision other than in the trivial case where no document is ever

returned for any query!

Both precision and recall have a fixed range: 0.0 to 1.0 (or 0% to 100%).

A retrieval engine must have a high recall to be admissible in most applications. The key in

building better retrieval engines is to increase precision without sacrificing recall.

6.1.2 Ranking: Mean Average Precision

Recall and precision are measures for the entire hitlist. They do not account for the quality of

ranking the hits in the hitlist. Users want the retrieved documents to be ranked according to their

relevance to the query instead of just being returned as a set. The most relevant hits must be in

the top few documents returned for a query. Relevance ranking can be measured by computing

precision at different cut-off points. For example, if the top 10 documents are all relevant to the

query and the next ten are all nonrelevant, we have 100% precision at a cut off of 10 documents

but a 50% precision at a cut off of 20 documents. Relevance ranking in this hitlist is very good

since all relevant documents are above all the nonrelevant ones. Sometimes the term recall at n

is used informally to refer to the actual number of relevant documents up to that point in the

hitlist, i.e., recall at n is the same as (n * precision at n).

Precision at n does not measure recall. A new measure called average precision that combines

precision, relevance ranking, and overall recall is defined as follows:

Page 21: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

21

Let n be the number of hits in the hitlist; let h[i] be the ith hit in the hitlist; let rel[i] be 1 if h[i] is

relevant and 0 otherwise; let R be the total number of relevant documents in the collection for

the query.

Then precision at hit j is: precision[j] = Σk = 1..j rel[k] / j

Mean Average precision = Σj = 1..n (precision[j] * rel[j]) / R

That is, average precision is the sum of the precision at each relevant hit in the hitlist divided by

the total number of relevant documents in the collection.

Average precision is an ideal measure of the quality of retrieval engines. To get an average

precision of 1.0, the engine must retrieve all relevant documents (i.e., recall = 1.0) and rank them

perfectly (i.e., precision at R = 1.0). Notice that bad hits do not add anything to the numerator,

but they reduce the precision at all points below and thereby reduce the contribution of each

relevant document in the hitlist that is below the bad one. However, average precision is not a

general measure of utility since it does not account any cost for bad hits that appear below all

relevant hits in the hitlist.

6.1.3 Ordering: using Kendall’s Tau

MAP evaluates the ranking of relevant results w.r.t. non-relevant results. Yet, it does not evaluate

the ordering of relevant results w.r.t. the reference ordering. To do so, dedicated statistical

correlation measures can be used, namely: Kendall’s tau coefficient and/or Spearman’s rho

coefficient. Both behave almost the same way (but are computed different): correlation between

two variables will be high when observations have a similar (or identical for a correlation of 1)

rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.)

between the two variables, and low when observations have a dissimilar (or fully different for a

correlation of -1) rank between the two variables.

We will be using Kendall’s tau:

Let (x1, y1), (x2, y2), …, (xn, yn) be a set of observations of the joint random variables X and Y

respectively, such that all the values of (xi) and (yi) are unique. Any pair of observations (xi, yi) and

(xj, yj), where are said to be concordant if the ranks for both elements agree: that is, if both xi > xj

and yi > yj or if both xi < xj and yi < yj. They are said to be discordant, if xi > xj and yi < yj or if xi < xj

and yi > yj. If xi = xj or yi = yj, the pair is neither concordant nor discordant.

The Kendall τ coefficient is defined as:

[-1,1](number ofconcordant pairs ) (number of discordant pairs )

n ( n 1 ) / 2

where the denominator is the total number of pair combinations, so the coefficient must be in

the range −1 ≤ τ ≤ 1.

Page 22: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

22

If the agreement between the two rankings is perfect (i.e., the two rankings are

the same) the coefficient has value 1.

If the disagreement between the two rankings is perfect (i.e., one ranking is the

reverse of the other) the coefficient has value −1.

If X and Y are independent, then we would expect the coefficient to be

approximately zero.

Example:

1. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of

returned results <a, b, c, d>

Here, 16 0

6

perfect agreement in ordering: 6 concordant pairs:

(a, a) and (b, b)

(a, a) and (c, c)

(a, a) and (d, d)

(b, b) and (c, c)

(b, b) and (d, d)

(c, c) and (d, d)

2. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of

returned results <b, a, c, d>

Here, 5 1 2

6 3

1 discordant pair: (a, b) and (b, a)

3. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of

returned results <b, a, d, c>

Here, 4 2 2

6 6

2 discordant pairs: (a, b) and (b, a)

(d, c) and (c, d)

4. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of

returned results <d, c, b, a>

Here, 10 6

6

6 discordant pairs:

(a, d) and (b, c)

(a, d) and (c, b)

(a, d) and (d, a)

(b, c) and (c, b)

(b, c) and (d, a)

(c, b) and (d, a)

In our experimental evaluation, we plan to evaluate Kendall’s tau considering the relevant

results which have been retrieved, while disregarding non-relevant results and relevant ones

which were not retrieved. For example:

Page 23: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

23

5. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of

returned results <a, b, c>, we will compute Kendall’s tau considering the relevant results

which were retrieved: a, b, and c

Here, 13 0

3

3 concordant pairs:

(a, a) and (b, b)

(a, a) and (c, c)

(b, b) and (c, c)

6. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of

returned results <a, x, b, c>, we will compute Kendall’s tau considering the relevant results

which were retrieved: a, b, and c , producing same result as above:

13 0

3

3 concordant pairs:

(a, a) and (b, b)

(a, a) and (c, c)

(b, b) and (c, c)

In other words, MAP will be necessary to evaluate ranking or retrieved relevant w.r.t. non-

relevant results, and Kendall’s tau will be necessary to evaluate ranking (relative ordering) within

retrieved relevant results themselves. We will need to compute both.

6.2 Preliminary Results To validate our approach, we have implemented our SemIndex algorithms using Java. We also have used MySQL 5.6 as an RDBMS, and WordNet 3.0 as a knowledge base. We used the IMBD

movies table2 as an average-scale3 input textual collection, including the attributes movie_id and (title, plot) concatenated in one column (cf. Table 1) with a total size of around 75 MBytes including more than 7 million rows. WordNet 3.0 had a total size of around 26 Mbytes, including more than 117k synsets (senses). We formulated different kinds of queries which were organized in three major categories: i) unrelated queries, ii) expanded queries, and iii) identical term queries, shown in Table 2.

Table 2 - Test queries

Query group Q1 – Unrelated queries Query group Q2 – Expanded queries Query group Q3 – Identical terms

ID Terms ID Terms ID Terms

Q1_1 “time” Q2_1 “car” Q3_1 “time”

Q1_2 “love”, “date” Q2_2 “car”, “muscle” Q3_2 “time”, ”time”

Q1_3 “fly”, “power”, “man” Q2_3 “car”, “muscle”, “classic” Q3_3 “time”, ”time”, ”time”

Q1_4 “robot”, “human”, “war”, “world” Q2_4 “car”, “muscle”, “classic”, “speed” Q3_4 “time”, ”time”, ”time”, ”time”

Q1_5 “mafia”, ”kill”, ”mob”, ”hit”, ”family” Q2_5 “car”,“muscle”,“classic”,“speed”,“thrills” Q3_5 “time”, ”time”, ”time”, ”time”, ”time”

2 Internet Movie DataBase raw files are available from online http://www.imdb.com/. We used a dedicated data extraction tool (at

http://imdbpy.sourceforge.net/) to transform IMDB files into a RDB. 3 Tests using large-scale TREC data collections and the Yago ontology as a reference KB are underway within an dedicated comparative

evaluation study.

Page 24: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

24

The first category consists of queries with varying numbers of selection terms (keywords), e.g., from 1 (single term query) to 5, where all terms are different and all queries are unrelated (i.e., queries with no common selection terms, cf. sample query group Q1 in Table 2). The second category consists of queries with varying numbers of selection terms, where terms are different yet queries are related: such that each query expands its predecessor by adding an additional selection term to the latter (cf. sample query group Q3 in Error! Reference source not found.). The third category consists of queries with varying numbers of selection terms which are identical (cf. sample query group Q3 in Table 2 made of term “time”). We used queries from this category to double check and verify the time complexity of the SemIndex querying process: testing the system’s performance regardless of the nature of the particular terms used4.

We considered 5 groups of queries (made of 5 queries each) within each category (e.g., Q1 is one

of the 5 groups of queries considered within the category of unrelated queries). Each query was

tested on every one of the 100 combinations of SemIndex generated by combining the different

chunks of the IMDB movies table (10%,20%,30%, …, 100%) with every chunk of WordNet (10%,

20%, 30%, …, 100%), at link distance threshold values varying from L = 1 to 5. All queries were

processed 5 times each, retaining average processing time. Hence, all in all, we ran an overall of:

3 (categories) 5 (groups) 5 (queries) 100 (SemIndex chunks) 5 (values) 5 (runs) = 187500

query execution tasks. Hereunder, we present and discuss the results obtained with three sample

query groups Q1, Q2, and Q3, corresponding to each category as shown in Table 2, compiled in

Figure 11 and Figure 12 (remaining query groups show similar behavior, cf. technical report in

[36], and thus were omitted here for ease of presentation).

6.2.1 Querying Effectiveness

In addition to evaluating SemIndex’ efficiency (processing time), we also evaluated its effectiveness (result quality), i.e., evaluating the interestingness of semantic-aware answers from the user’s perspective. To do so, we collected the results of our test queries obtained with and without semantic indexing, i.e., querying using the legacy InvIndex (which is equivalent to executing queries as standard containment queries (L = 1) in SemIndex), and performing semantic-aware queries (L = 2-to-5) with SemIndex. Results were mapped against user feedback (user judgments, utilized as golden truth) evaluating the quality of the matches produced by the system by computing precision and recall metrics commonly utilized in IR evaluation [29]. Precision (PR) identifies the number of correctly returned results, w.r.t. the total number of results (correct and false) returned by the system. Recall (R) underlines the number of correctly returned results, w.r.t. the total number of correct results, including those not returned by the system. In addition to comparing one approach’s precision improvement to another’s recall, it is a widespread practice to consider the f-value, which represents the harmonic mean of precision and recall, such that high precision and recall, and thus high f-value characterize good retrieval quality [29]. Ten test subjects (six master students, and four doctoral students, who were not part of the system development team) were involved in the experiment as human judges. Testers were asked to evaluate the quality of the top 1000 results (movie objects returned) per query (since manually evaluating the tens of thousands of obtained results is practically infeasible)

4 When processing queries made of different terms, the terms will be mapped to different indexing nodes in the SemIndex graph, each

index node highlighting different connectivity properties and graph neighborhoods, which entail different shortest paths when

navigating the graph to perform semantic-aware querying. The span and length of these shortest paths might affect processing time

(cf. time complexity in Section Error! Reference source not found.). Thus, we chose to utilize same term queries (e.g., query

group Q2) to evaluate querying time considering the querying process itself, regardless of the particular properties of the shortest

paths and neighborhoods of each different term.

Page 25: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

25

obtained with L = 6 (as an upper bound of L = 6 (as an upper bound of obtained with nd four doctoral students, who were not part of the system development team) were involved in the

experiment as human {-1, 0, 1}, i.e., {not relevant, neutral, relevant}) were acquired for each query answer. Then, we quantified inter-tester agreement, by computing pair-wise correlation

scores5 among testers for each of the rated query answers, and subsequently selected the top

500 hundred answers per query having the highest average inter-tester correlation scores6, which we utilized as the experiment’s golden truth. Results for queries of groups Q1 (unrelated queries) and Q2 (expanded queries) are shown in Figure 12 and Figure 13 respectively, whereas overall f-value results are provided in Figure 123. Results highlight several observations. 1) Precision and link distance: One can realize that precision levels computed with both query groups Q1 (unrelated queries, Figure 12a) and Q2 (expanded queries, Figure 13 a) generally increase with link distance (L), until reaching L = 3 (with Q2) or L = 4 (with Q1) where precision starts to slightly decrease toward L = 5. On one hand, this shows that the number of correct (i.e., user expected) results increases as more semantically related terms are covered in the querying

process (with L > 1). On the other hand, this also shows that over-navigating the SemIndex graph

to link terms with semantically related ones located as far as L 3 hops away might include results which: i) are somehow semantically related to the original query terms, but which ii) are not necessarily interesting for the users. For instance, term “congo” (meaning: black tea grown in

China) is linked to term “time” through L = 5 hops in SemIndex (“time” >> “snap” >> “reception” >> “tea” >> “congo”). Yet, results (movie objects) containing term “congo” (e.g., movies about the country Congo, or its continent Africa) were not judged to be relevant by human testers when applying query “time” (testers were probably expecting movies about the passage of time or time travel instead, etc.). 2) Precision and number of query terms: Here, one can realize that precision levels with queries of group Q1 (unrelated queries, Figure 12 a) do not seem to largely vary w.r.t. the number of terms (k) per query, whereas precision levels with queries of group Q2 (expanded queries, Figure 12 .a) clearly increase with k. These seemingly different results are due to the human testers’ expectations, where testers were required to judge the quality of each query’s results given the user’s supposed intent. On one hand, given that queries in group Q1 are unrelated, result quality was evaluated separately for each query, based on the query’s own keyword terms (e.g., the intent of query Q1_1 is identifying movies that have to do with time, and the intent of Q1_3 is movies that have to do with a flying power man, etc.). On the other hand, given that queries in group Q2 are expanded versions of one another, result quality was evaluated based on the user’s intent: which would be naturally expressed with the most expanded (i.e., most expressive) query: Q2_5. One can realize that using fewer query terms here produces lower precision levels, which is due to the system returning more results which are (semantically related to the query terms but which are) not necessary related to the user’s intent (e.g., query “car” might return movies that have to do with trains or taxi cabs, whereas the user is apparently searching for movies that have to do with muscle cars with speed driving and thrills, cf. Q2_5). In other words, with query group Q2: the lesser the number of query terms used, the lesser the query’s expressiveness w.r.t.

5 Using Pearson Correlation Coefficient (PCC), producing scores [-1, 1] such that: -1 designates that one tester’s ratings is a decreasing function

of the other tester’s ratings (i.e., answers deemed relevant by one tester are deemed irrelevant by the other, and vice versa), 1 designates that one tester’s ratings is an increasing function of the other tester’s ratings (i.e., answers are deemed relevant/irrelevant by testers alike), and 0 means that tester ratings are not correlated.

6 Having average inter-tester PCC score 0.4.

Page 26: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

26

user’s intent, and thus the larger the number of returned results which are not necessarily related to the user’s intent: producing lower precision.

a. Precision results

b. Recall results

Figure 9 Comparing precision (PR) and recall (R) results obtained using SemIndex versus legacy InvIndex, with query group Q1 (unrelated queries), varying the number of query terms k and link distance

a. Precision results

b. Recall results

Figure 10 Comparing precision (PR) and recall (R) results obtained using SemIndex versus legacy InvIndex, with query group Q2

a. F-value results with query group Q1 b. F-value results with query group Q2

Figure 11 Comparing f-value levels obtained using SemIndex versus legacy InvIndex, with query group Q1 (unrelated queries), and query group Q2 (expanded queries)

3) Recall and link distance: As for recall, one can realize that levels obtained with both Q1 and Q2 steadily increase with link distance (L) varying from L = 1 (legacy InvIndex) to 5 Figure 11 a and

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Pre

cisi

on

N# of query terms k

1 2 3 4 5Link distance lInvIndex

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Re

call

N# of query terms k

1 2 3 4 5InvIndex Link distance l

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Pre

cisi

on

N# of query terms k

1 2 3 4 5Link distance lInvIndex

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Re

call

N# of query terms k

InvIndex 1 2 3 4 5InvIndex Link distance l

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

F-va

lue

N# of query terms k

1 2 3 4 5Link distance lInvIndex

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

F-va

lue

N# of query terms k

InvIndex 1 2 3 4 5InvIndex Link distance l

Page 27: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

27

Figure 12 b . This maps to observation 1, where the number of correct (i.e., user expected) results returned by the system increases as more semantically related terms are covered in the querying process. In other words, the more the number of correct results which are returned by the system, the fewer the number of correct results which are not returned, and thus the higher the recall levels. Note that returning noisy (incorrect) results along with the correct ones does not affect recall (but rather affects precision as explained in observation 1). 4) Recall and number of query terms: Recall levels vary in a similar fashion to precision levels when varying link distance (L): increasing with the increase of L, which accounts for more semantic coverage (returning more semantically related results) in the SemIndex graph (Figure 11 b and Figure 12 b. However, recall levels tend to decrease (rather than increase) with k. This is due to the fact that shorter (less expressive) queries (i.e., with smaller k values) will naturally return more (semantically related) results than larger queries made of multiple terms (larger k) which will necessarily identify less results (e.g., lesser number of movie objects matching the query’s terms). Hence, a decrease in the number of returned results (with increasing k values) meant the number of correct (and incorrect) results (naturally) decreased, which lead to a decrease in recall. 5) As for f-value results, levels clearly and significantly increase with the increase of link distance L, whereas they slightly decrease with the increase of the number of query keywords k. This naturally confirms the precision and recall levels obtained above, where the determining factor affecting retrieval quality remains link distance, whereas an increase in the number of keywords k tends to reduce system recall with higher values of k (queries becoming very selective, thus missing some relevant results). Note that f-value levels are consistently significantly higher than those obtained with the legacy InvIndex, highlighting a substantial improvement of semantic-aware retrieval quality over syntactic retrieval quality.

6.2.2 Querying Efficiency

We ran the same querying tasks through the legacy InvIndex built on top of IMDB, and compared the obtained query time results with those of SemIndex. Figure 14 and Figure 15 provide results obtained when running queries of group Q1 (unrelated), plotted by varying the number of query terms k (Figure 14) and SemIndex link distance threshold L, in (Figure 15). Similar results were obtained with queries of group Q2 (expanded) and Q3 (identical terms) and have been omitted here for clarity of presentation (they are provided in [36]).

Page 28: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

28

a. SemIndex query execution time

b. InvIndex versus SemIndex at L = 1 c. InvIndex versus SemIndex at L = 2

d. InvIndex versus SemIndex at L = 3

e. InvIndex versus SemIndex at L = 4

f. InvIndex versus SemIndex at L = 5

Figure 12 Comparison with legacy InvIndex query execution time, considering queries of group Q1 (unrelated queries),

a. SemIndex query execution time

b. InvIndex versus SemIndex at k = 1 c. InvIndex versus SemIndex at k = 2

d. InvIndex versus SemIndex at k = 3

e. InvIndex versus SemIndex at k = 4

f. InvIndex versus SemIndex at k = 5

Figure 13 Comparing SemIndex and legacy InvIndex query execution time, considering queries of group Q1 (unrelated queries),

0

1

2

3

4

5

6

7

8

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

N# of query terms k

1

2

3

4

5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

N# of query terms k

SemIndex

InvIndex

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

N# of query terms k

SemIndexInvIndex

0

0.3

0.6

0.9

1.2

1.5

1.8

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

N# of query terms k

SemIndex

InvIndex

0

0.5

1

1.5

2

2.5

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

N# of query terms k

SemIndex

InvIndex

0

2

4

6

8

10

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

N# of query terms k

SemIndex

InvIndex

0

1

2

3

4

5

6

7

8

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

N# of query terms k

1

2

3

4

5

0

0.3

0.6

0.9

1.2

1.5

1.8

1 2 3 4 5

Tim

e (

in s

eco

nd

s)

Link distance l

SemIndex

InvIndex

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5

Tim

e (

in s

eco

nd

s)

Link distance l

SemIndexInvIndex

0

2

4

6

8

10

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

Link distance l

SemIndex

InvIndex

0

1

2

3

4

5

6

7

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

Link distance l

SemIndex

InvIndex

0

1.5

3

4.5

6

7.5

9

1 2 3 4 5

Tim

e (i

n s

eco

nd

s)

Link distance l

SemIndexInvIndex

Link distance l

Link distance l

Page 29: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

29

On one hand, results in Figure 14 and Figure 15 show that SemIndex and InvIndex have very close query

time levels when link distance is small (L =1 and L =2), such that SemIndex time increases as link distance

increases, reaching its highest levels with L =5 (i.e., almost 8 times higher than InvIndex time levels, with

SemIndex reaching 5 hops into the semantic graph structure to identify more semantically related

results). On the other hand, Figure 14 shows that both SemIndex and InvIndex query time levels slightly

increase when increasing the number of query keywords k with small link distances (L =1 and L =2), such

that the pace of increase tends to augment with k when reaching higher link distance thresholds (L =3-to-

5). In other words, the time to navigate the semantic graph, following the allowed link distance L,,

remains the foremost determining factor in query execution time. Also, results in Figure 15 show that

InvIndex query time is invariant w.r.t. variations in link distance L (since it does not navigate the semantic

graph, and thus does not perform semantic-aware processing).

7 Conclusion

7.1 Summary of Contribution The main goal of our research project was to complete the design and development of

SemIndex’s query evaluation engine. To this end, we designed and implanted various components

and functionality, including: i) a dedicated weight functions, associated with the different

elements of SemIndex, allowing semantic query result selection and ranking, coupled with iii) a

dedicated relevance scoring measure, required in the query evaluation process in order to

retrieve and rank relevant query answers, iii) various alternative query processing algorithms (in

addition to the main algorithm), as well as iv) a dedicated GUI interface allowing user to easily

manipulate the prototype system. Preliminary experimental results highlight our solution’s

querying effectiveness and efficiency.

7.2 Ongoing Works We are currently completing an extended experimental study to evaluate SemIndex’s properties

in terms of i) genericity: to support different types of textual (structured, semi-structured, NoSQL)

data collections, and different semantic knowledge sources (general purpose like: Roget’s

thesaurus [40], Yago [18], and Google [21], as well as domain specific: like ODP [26] for describing

semantic relations between Web pages, FOAF [2] to identify relations between persons in social

networks, and SSG [34] to describe visual and semantic relations between vector graphics), ii)

effectiveness: evaluating the interestingness of semantic-aware query answers considering

different query answer weighting and ranking (result ordering) schemes, in comparison with IR-

based indexing, query expansion, and semantic disambiguation methods, and iii) efficiency: to

reduce the index’s building and query processing costs, using multithreading and various index

fragmentation and sub-graph mining techniques [8].

Page 30: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

30

References

[1] Agrawal S., Chaudhuri S. and Das G., DBXplorer: enabling keyword search over relational databases. Proceedings of the International Conference on Management of Data (SIGMOD'02), 2002. pp. 627–627.

[2] Aleman-Meza B., Nagarajan M., Ding L., et al., Scalable Semantic Analytics on Social Networks for Addressing the Problem of Conflict of Interest Detection. ACM Transaction on the Web (TWeb), 2008. 2(1):7.

[3] Bednar Peter, Sarnovsky M. and Demko V., RDF vs. NoSQL databases for the Semantic Web applications. IEEE 12th International Symposium on Applied Machine Intelligence and Informatics (SAMI'14), 2014. pp. 361-364.

[4] Blanco R., Mika P. and Vigna S., Effective and Efficient Entity Search in RDF data. In International Semantic Web Conference (ISWC'11), 2011. pp. 83–97.

[5] Burton-Jones A.; Storey V.C.; Sugumaran V. and Purao S., A Heuristic-Based Methodology for Semantic Augmentation of User Queries on the Web. In Proceedings ot the International Conference on Conceptual Modeling (ER'03), 2003. pp. 476–489.

[6] Chandramouli K., Kliegr T. , Nemrava J., et al., Query Refinement and user Relevance Feedback for contextualized image retrieval. 5th International Conference on Visual Information Engineering (VIE), 2008. pp. 453 - 458.

[7] Chbeir R., Luo Y., Tekli J., et al., SemIndex: Semantic-Aware Inverted Index. 18th East-European Conference on Advanced Databases and Information Systems (ADBIS'14), 2014. pp. 290-307.

[8] Cheng J., Ke K., Wai-Chee Fu A.., et al., Fast graph query processing with a low-cost index. VLDB Journal, 2011. 20(4): 521-539.

[9] Cheng T., Yan X. and Chang K. C., EntityRank: searching entities directly and holistically. Proceedings of the 33rd international conference on Very Large Data Bases (VLDB'07), 2007. pp. 387-398.

[10] Chu E., Baid A., Chen T., et al., A relational approach to incrementally extracting and querying structure in unstructured data. Proceedings of the 33rd international conference on Very Large Data Bases (VLDB '07), 2007. pp. 1045-1056

[11] Cimiano P.; Handschuh S. and Staab S., Towards the Self-Annotating Web. In Proceedings of the International World Wide Web Conference (WWW'04), 2004. pp. 462-471.

[12] Das S., e.a., Making unstructured data sparql using semantic indexing in oracle database. In Proceedings of 29th IEEE ICDE Conf., 2012. pp. 1405–1416

[13] de Lima E.F. and Pedersen J.O., Phrase Recognition and Expansion for Short, Precision biased Queries based on a Query Log. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. pp. 145-152, Berkeley, CA.

[14] Ding B., Xu Yu J., Wang S., et al., Finding top-k min-cost connected trees in databases. Proceedings of the International Conference on Data Engineering (ICDE'07), 2007.

[15] Florescu D., Kossmann D. and Manolescu I., Integrating Keyword Search into Xml Query Processing, . Computer Networks, 2000. 33(1-6):119–135.

[16] Frakes W.B. and R.A. Baeza-Yates, Information retrieval: Data structures and algorithms. Prentice-Hall, 1992.

[17] Guo L., Shao F., Botev C., et al., XRANK: ranked keyword search over XML documents. Proceedings of the International Conference on Management of Data (SIGMOD), 2003. pp. 16-27.

[18] Hoffart J., Suchanek F.M., Berberich K., et al., YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artif. Intell., 2013. 194: 28-61.

Page 31: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

31

[19] Hristidis V., Koudas N., Papakonstantinou Y., et al., Keyword proximity search in xml trees. IEEE Transactions on knowledge and Data Engineering (TKDE), 2006. 8(4):525–539.

[20] Hristidis V. and Papakonstantinou Y., DISCOVER: Keyword search in relational databases. Proceedings of the International Conference on Very Large Databases (VLDB), 2002.

[21] Klapaftis I. and Manandhar S., Evaluating Word Sense Induction and Disamiguation Methods. Language Resourses and Evaluation, 2013. 47(3):579-605.

[22] Li F. and J. H.V., Constructing an Interactive Natural Language Interface for Relational Databases. Proceedings of the VLDB Endowment, 2014. pp. 73-84.

[23] Li Y., Yang H. and Jagadish H.V., Term Disambiguation in Natural Language Query for XML. In Proceedings of the International Conference on Flexible Query Answering Systems (FQAS), 2006. LNAI 4027, pp. 133–146.

[24] Liu F., Yu C., Meng W., et al., Effective keyword search in relational databases. Proceedings of the 2006 ACM SIGMOD international conference on Management of data, 2006. pp. 563-574

[25] Luo Y., Lin X., Wang W., et al., Spark: top-k keyword query in relational databases. Proceedings of the 2007 ACM International Conference on Management of Data (SIGMOD-07), 2007. pp. 115-126.

[26] Maguitman A., Menczer F., Roinestad H., et al., Algorithmic Detection of Semantic Similarity. Proceedings of the International Conference on the World Wide Web (WWW), 2005. pp. 107-116.

[27] Markowetz A. and P.D. Yang Y., Keyword search on relational data streams. Proceedings of the International Conference on Management of Data (SIGMOD'07), 2007. pp. 605–616.

[28] Martinenghi D. and Torlone R., Taxonomy-based relaxation of query answering in relational databases. VLDB Journal, 2014. 23(5):747-769.

[29] McGill M., Introduction to Modern Information Retrieval. 1983. McGraw-Hill, New York.

[30] Miller S., Bobrow R., Ingria R., et al., Hidden understanding models of natural language. Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, 1994. pp. 25–32.

[31] Mishra C. and Koudas N., Interactive Query Refinement International Conference on Extending Database Technology (EDBT'09), 2009. pp. 862-873.

[32] Navigli R. and Crisafulli G., Inducing Word Senses to Improve Web Search Result Clustering. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010. pp. 116–126, MIT, USA.

[33] Nihalani N., Silakari S. and Motwani M., Natural language Interface for Database: A Brief review. International Journal of Computer Science Issues, 2011. 8(2):600-608.

[34] Salameh K., Tekli J. and Chbeir R., SVG-to-RDF Image Semantization. 7th International SISAP Conference, 2014. pp. 214-228.

[35] Sinh Hoa Nguyen, Wojciech Świeboda and G. Jaśkiewicz, Semantic Evaluation of Search Result Clustering Methods. Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence Volume 467,

, 2013. 467(393-414).

[36] Tekli J., Chbeir R., Luo Y., et al., SemIndex: Semantic-Aware Inverted Index - Technical Report. Available at http://sigappfr.acm.org/Projects/SemIndex/, 2015, 2015.

[37] Voorhees E.M., Query Expansion Using Lexical-Semantic Relations. In Proceedings of the 17th International ACM/SIGIR Conference on Research and Development in Information

Retrieval, 1994. pp. 61-69.

[38] Wen H., Huang G.S. and L. Z., Clustering web search results using semantic information International Conference on Machine Learning and Cybernetics, 2009. 3(1504 - 1509).

Page 32: A Prototype for Semantic Querying and Evaluation of ...text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying entities) from text (e.g., Web

COE594 Undergraduate Research Project – Final Report

32

[39] Wu P., Sismanis Y. and Reinwald B., Towards keyword-driven analytical processing. Proceedings of the International Conference on Management of Data (SIGMOD'07), 2007. pp. 617–628.

[40] Yaworsky D., Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora. Proceedings of the International Conference on Computational Linguistics (Coling), 1992. Vol 2, pp. 454-460. Nantes.

[41] Zhang C., Naughton J., DeWitt D., et al., On supporting containment queries in relational database management systems. SIGMOD Record, 2001. 30(2), 425–436.