[ieee 2012 nirma university international conference on engineering (nuicone) - ahmedabad, gujarat,...

2012 NIRMA UNIVERSITY INTERNATIONAL CONFERENCE ON ENGINEERING, NUiCONE-2012, 06-08DECEMBER, 2012

978-1-4673-1719-1/12/$31.00©2013IEEE

1

Abstract-- Crushing of textual information is available in electronic form on internet. As a result finding the answer to user query essential in natural Language processing, information retrieval and question answering. Semantic based question reformulation is frequently used in question answering system to retrieve answer from large document collection. The goal of this paper is to find useful and standard reformulation pattern, which can be used in our question answering (QA) system to find exact candidate answer. In this paper we used TREC-8, TREC-9, and TREC-10 collection as training set. Different types of question and corresponding answer can use from TREC collection. The QA system will automatically extract the pattern from sentences retrieved from search engine. With help of word net it will check syntactic tags, semantic relation between question and answer pair. Next weight can assigned to each extracted pattern according to is length, the distance between keyword and the level of semantic similarity between the extracted question and answer. The proposed systems vary from most former other reformulation learning system..

Index Terms—Question answering system, question reformulation, Natural language processing.

I. INTRODUCTION

The main purpose of semantic based reformulation is important for human computer interaction (HCI) and answer extraction. Current improvement in question answering have made it possible for user ask question in natural language (eg. who is the president of stand Ford University?) and receive the specific answer (Kennedy) rather than searching irrelevant document from the search engine. Today question reformulation play vital role in QA system to identifying possible form of expressing answer given a natural language question. QA system use reformulation to retrieve the answer in huge document collection. For example the search engine searches the answer for given question who is the president of stand Ford University? The reformulation based QA system will search for formulation like <NP> the president name of stand Ford University or the stand ford university president name in document collection and will instantiate <NP> with matching noun phrase.

[15] Barzilay, R., K.R. McKeown has discussed three different methods such as manual collection and corpus based extraction for collecting paraphrase. Web used as a linguistic resource for finding semantic equivalent pattern from natural language question. A large amount of work in QA system has been concerned in question reformulation; that is, finding the semantically based reformulation pattern from the natural

language question. The problem of question reformulation has not been received much attention. In this paper we present semantic based question reformulation techniques to improve the performance of the system. The perfect reformulation helps to identify correct answer. It won’t give wrong answer to the QA system. QA system first parses the specified question and then identifies its answer type. Question reformulation module uses the parsed version of reformulation pattern to extract answer from the sentence returned by the search engine. In the case of multilingual language writing reformulation is tedious task that must be repeated for each type of question. This is why many researcher attempts at acquiring reformulation automatically.

The rest of the paper organized as follows: section 2 present related works on reformulation based QA system. Section 3 discussed about the proposed architecture of semantic based reformulation learning system. Section 4 Evaluation of the system Section 5 discuss conclusion and feature work.

II. RELATED WORK Information retrieval (IR) systems are helps to find the appropriate document that satisfy user information requirements in huge document collection. The major work of the IR system will retrieve the most relevant document to the user query posted in natural language. Nevertheless, [1] Harabagiu explain it IR is better described as document retrieval. In 1998 National Institute standard technologies (NIST) have added (QA) track as a new task to their opposition- style Text Retrieval conference (TREC) [2].[16]Federica Mandreoli has discussed about query reformulation frame work for P2P network . For instance, the user has given question, who is the president of stand Ford University? A QA system will search the answer form a 3GB document collection and return the exact answer as Kennedy instead of returning the entire document collection. The following shows list of factoid question form the TREC-QA. Mean reciprocal rank (MRR) used has been used standard measure the evaluate QA system in TREC collection. QA system will return a ranked list of candidate answer for each question posed by user. Reciprocal rank (RR) can be used as a measure to compute the score for a question (x). If the answer is present in the candidate list, the score is equal to the reciprocal of its rank; otherwise the score is zero. Rivachandran et al [3] has used machine learning technique

Improving QA performance through semantic reformulation

Muthukrsihanan Ramprasath and Shanmugasundaram Hariharan

2012 NIRMA UNIVERSITY INTERNATIONAL CONFERENCE ON ENGINEERING, NUiCONE-2012 2

in QA system to automatically learn question and answer patterns along with a confidence score.

TABLE I FACTOID QUESTION FROM TREC-QA TRACK (TRACK-8-11)

Kwok et al [4] has performed the syntactic modification on question and answer pair with the help of transformational method. Soubbotin et al. [5] has used first reformulation pattern as the core of their QA system. [18-20] Our QA system uses training corpus of 1343 question-answer pairs taken from the TREC-8, TREC-9, and TREC-10 collection data.

QA system use web as a linguistic resource for reformulation. The techniques are based on identifying various ways for expressing the answer context given a natural language question.[5] soubbotin et al utilize reformulation pattern as a nucleus of their QA system. He manually generate pattern for each question in TREC-10 QA track. [7] E. Brill, J. Et al were generated pattern automatically with the help of simple word permutation to produce paraphrase. With the result of permuting words of question resembling simple words they produced large set of reformulation.

[8] Kwok, et al was performing syntactic modification on question using transformational grammar such as Subject-Aux and Subject-Verb activities.[17] Kwok, C.C.T has discussed about transformational grammar to perform syntactic modification. [9] Radev, be trained the most excellent query reformulations for their QA system. [10] Molla QA system translates question and answer sentence in to graph based logical form representation. [11] Stevenson et al used vector space model to learn the answer pattern and rank candidate answer. The answer pattern consist of

predicate argument structure is mapped to the subject, verb, object (SVO) of clause.

III. SEMANTIC BASED REFORMULATION LEARNING SYSTEM.

The web used as a linguistic resource to learn reformulation pattern for given question. [12] Barzilay, R., K.R. McKeown has discussed discriminate between three different methods such us manual collection, corpus based extraction and use of linguistic resource. With these methods manual collection of paraphrase for given question and answer pair is surely the easiest one to execute. Semantic network and word net (lexical database) also prove useful for gathering paraphrases for given question.

Fig. 1. Semantic based reformulation learning architecture. [13]. Riloff, E has discussed about information extraction approach that can be adopted to solving the problem of reformulation learning.

The user question and answer pair is analyzed to extract the argument and semantic relation holding between the question and answer. With the help of the argument extracted from the question and answer pair then the query is formulated. Subsequently the web search engine retrieves the formulated query and returns the most relevant documents. The query terms are then filtered from the retrieved documents to keep only these that contain the semantic relation. These are then passed to NLP tool such as part-of-speech tagger, named entity recognizer and noun phrase chunker, to select be generalized into an answer pattern using syntactic and semantic tags. Finally semantic distance is calculated between question and frequency of the pattern confidence weight is assigned to each generated pattern.

1) Question and answer patterns.

User given a question to the system its needs to identify the answer pattern for the question looking for. A question pattern: who vb person is example for question pattern that matches who is the president of stand Ford University. An answer pattern: once the question pattern is matched to the input question then a set of answer pattern will search for the document collection. It could specify the form of sentence that may hold a possible candidate answer. For instance who is the president of stand Ford University? The QA system tries to discover sentence that may match any one of these of answer pattern:

<QS> <VB> <ANS> <ANS> <VB> by <QS> Where

(ANS) Candidate answer. Kennedy

TREC #

Question #

Questions

TREC 8

Q 11 Who was President Cleveland's wife?

TREC 8

Q 12 How much did Manchester United spend on players in 1993?

TREC 9

Q 211 Where did bocci originate?

TREC 9

Q 212 Who invented the electric guitar?

TREC 10

Q 901 What is Australia's national flower?

TREC 10

Q 902 Why does the moon turn orange?

TREC 11

Q 1401 What is the democratic party symbol?

TREC 11

Q 1402 What year did Wilt Chamberlain score 100 points?

2012 NIRMA UNIVERSITY INTERNATIONAL CONFERENCE ON ENGINEERING, NUiCONE-2012, 06-08DECEMBER, 2012

978-1-4673-1719-1/12/$31.00©2013IEEE

3

(QS) Question term. President of stand Ford University (VB) Verb in simple form.

2) Semantic based Reformulation system The reformulation system is to analyze how persons will

logically form queries to discover the answer to an individual question. Our reformulation used 200 questions from TREC8, 693 questions from TREC9 and 500 questions from TREC10. We randomly selected questions from TREC9 collection. We produced simplest queries for each question that yield the most relevant web pages containing the answer. For instance some of the question and corresponding web queries are given below: 1. Where is Belize located? “the Belize is located in” 2. How many continents are there? “the container” AND “quantity” 3. When was the slinky invented? “the slinky invented” on OR “the slinky invented in” The semantic based reformulation composed of an acquisition and validation process. The acquisition process has a capable of digging the web for linguistic information. The following examples will shows working principle. In this reformulation system it is likely to avoid entirely keyword extraction phase and use of argument and use very common information extraction pattern directly derived from the arguments being processed.

Fig. 2. reformulation learning system

For instance, if the system found new formulations are being searched for based on the argument tuple [general electronic scientist, silly putty], then these arguments will be used as keywords, and two answer patterns will be searched for in the retrieved documents: In this above example, a verb is required to occur between the two keywords. This verb will describe a new possible formulation of the early semantic relation.

Next, in this validation stage binary decision making principle has been used regarding the formulation received from previous step, and discriminates the suitable and not suitable expression of the semantic relationship.

3) Generating answer pattern Once we have discovered semantically equivalent

sentences form the retrieved documents, we attempt to make things easier them into a pattern using both syntactic and semantic feature. To identify noun phrase and part of speech (POS), each sentence is tagged and syntactically chunked. Then we construct general form for answer pattern, we substitute the noun phrase with corresponding argument in the answer. To achieve the general answer pattern the prepositions is removed from the retrieved sentence.

4) Assigning confidence weight to candidate pattern Assigning a weight to each candidate pattern is a

challenging task because one answer pattern is more dependable than other. This helps us to better rank the answer pattern list, by their quality and precision. In our experiment we found that the answer sub-phrases score, the level of semantic similarity between the main verb of the pattern and the frequency of pattern, its length. To generate a weight for each pattern we used function which considers the entire above factor: the values of these weight lies between 0 and 1. Usually, let Xi be the ith pattern set X extracted for a question and answer pair we calculate each factor as following: Count (Xi) How many times the pattern xi was extracted for a given question pattern. Distance Compute the distance between the answer and the nearby term from question argument in the pattern. Length (Xi) Pattern length is measured in words. Sub_phrase_score the candidate sub phrase is depends on the similarity of full candidate answer. Semantic similarity (Aq, Bxi) calculate the similarity between candidate answer pattern and the question. We want to estimate the likelihood between the word in the sentence actually refer same fact or result. The weight given to the answer pattern is based on the semantic relation between the terms as specified in the word Net. Original verb in the question 1 point.

Synonyms of the question verb 21

point.

Hyponyms and hypernyms of the question verb 81

point.

The previous four factors will be used to calculate the final weight of the pattern.

Weight (xi) = )()(

pcountxcount i ×

)(1

ixlength×

cedis tan1 ×

Sub_phrase_score × sem_sim(Aq,Bxi).

IV. EVALUATION OF THE SYSTEM Perl scripting language was used to implement our system.

To make our code efficient we made slight changes in coding. The main purpose of our system is to evaluate the quality of the results.

TABLE II

2012 NIRMA UNIVERSITY INTERNATIONAL CONFERENCE ON ENGINEERING, NUiCONE-2012 4

RESULT OF EACH QUESTION CLASS WITH THE MANUALLY CREATED PATTERN

Question type

No of question

Question with at

least one candidate

answer

Top 5 Correct answer

for given question

Precision

Who 52 40 26 0.65 What 266 83 47 0.566

Where 39 22 13 0.599 When 71 29 21 0.724 How 53 17 9 0.529

Which 12 5 2 0.4 Why 0 2 0 0 Total 493 198 118 0.595

TABLE III

RESULRT OF EACH QUESTION CLASS WITH GENERATED PATTERN

Question type

No of question

Question with at

least one candidate

answer

Top 5 Correct answer

for given question

Precision

Who 52 47 29 0.425

What 266 88 52 0.590

Where 39 25 17 0.68

When 71 32 25 0.781

How 53 21 13 0.619

Which 12 7 3 0.428

Why 0 0 0 0

Total 493 220 157 0.713

We used 493 question-answers of the TREC collection data [14]. The TREC question set only used for training and evaluation. We submitted this question to our QA system. The system was evaluated with manually created reformulation pattern and with educated one. Then obtained candidate answers were compared. The results are reported in above tables 2 tables 3. Comparison of the results in tables 2 and tables 3 based on precision and number of question with at least one candidate answer. Tables 4 show the mean reciprocal rank for each class of question. While the results in the tables shows only slight improvement in precision. Our system results are limited to syntax of the pattern.

V. CONCLUSION AND FEATURE WORK We presented a technique for acquiring reformulation

pattern based on the semantic features of the sentence obtained from search engine. The experimental work shows that using semantic based reformulation pattern helps to improve the performance of QA system. The present system only concentrating semantic relation that asset between two or three argument. The work could be easily extended if we consider the variable size relation that holds between the

arguments. Future work will focus on the improving the quality of pattern by signify a systematic evaluation and adjustment of the parameter that take part in weighting the pattern.

VI. REFERENCES [1] S. Harabagiu, S. Maiorano, M. Pasca, Open-domian textual question

answering techniques, Natural Language Engineering 1 (2003) 1–38. [2] E.M. Voorhees, D.K. Harman (Eds.), Proceedings of the Eighth Text

REtrieval Conference (TREC-8), Gaithersburg, Maryland, NIST, 1999. [3] Rivachandram, D., Hovy, E.: Learning surface text patterns for a

question answeringsystem. In: Proceeding of ACL Conference, Philadephia (2002) 41–47.

[4] Kwok, C.C.T., Etzioni, O., Weld, D.S.: Scaling question answering to the web. In:World Wide Web. (2001) 150–161.

[5] M. Soubbotin, S. Soubbotin, Patterns of potential answer expressions as clues to the right answers, in: Proceedings of the 10th Text REtrieval Conference (TREC 2001), Gaithersburg, Maryland, 2001, pp. 175–182.

[6] Agichtein, E., Lawrence, S., Gravano, L.: Learning search engine specific query transformations for question answering. In: Proceedings of WWW10, Hong Kong (2001) 169–178.

[7] E. Brill, J. Lin, M. Banko, S. Dumais, A. Ng, Data-intensive question answering, in: Proceedings of the 10th Text REtrieval Conference (TREC 2001), Gaithersburg, Maryland, 2001, pp. 393–400.

[8] Kwok, C.C.T., Etzioni, O, Weld, D.S. Scaling question answering to the web. In: World Wide Web. (2001) 150–161.

[9] Radev, D.R., Qi, H., Zheng, Z., Blair-Goldensohn, S., Zhang, Z., Fan, W., Prager, J.M.: Mining the web for answers to natural language questions. In: CIKM. (2001)143–150.

[10] D. Molla`, Learning of graph-based question answering rules, in: Proceedings of HLT/NAACL 2006 Workshop on Graph Algorithms for Natural Language Processing, 2006, pp. 37–44.

[11] M. Stevenson, M. Greenwood, A semantic approach to IE pattern induction, in: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, 2005, pp. 379–386.

[12] Barzilay, R., K.R. McKeown (2001). Extracting paraphrases from a parallel corpus. In Proceedings of the Association for Computational Linguistics.

[13] Riloff, E., R. Jones (1999). Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the 16th National Conference on Artificial Intelligence.

[14] E.M. Voorhees, D.K. Harman (Eds.), Proceedings of the 11th Text REtrieval Conference (TREC 2002), Gaithersburg, Maryland, NIST, 2002.

[15] ] Barzilay, R., K.R. McKeown (2001). Extracting paraphrases from a parallel corpus. In Proceedings of the Association for Computational Linguistics.

[16] Matteo Golfarelli, Federica Mandreoli, “A Query Reformulation Framework for P2P OLAP,” Nicola Ferro and Letizia Tanca (Eds.): SEBD 2012.

[17] ] Kwok, C.C.T., Etzioni, O., Weld, D.S.: Scaling question answering to the web. In:World Wide Web. (2001) 150–161.

[18] NIST: Proceedings of TREC-8, Gaithersburg, Maryland, NIST (1999) available at trec.nist.gov/pubs/trec8.

[19] NIST: Proceedings of TREC-9, Gaithersburg, Maryland, NIST (2000) available at trec.nist.gov/pubs/trec9.

[20] NIST: Proceedings of TREC-10, Gaithersburg, Maryland, NIST (2001) available at trec.nist.gov/pubs/trec10

[ieee 2012 nirma university international conference on engineering (nuicone) - ahmedabad, gujarat,...

Documents