presentation of domain specific question answering system using n-gram approach
TRANSCRIPT
Presented by:
Tasnim Ara Islam
Roll: 1007010
Farh Naz Chowdhuy
Roll: 1007038
Supervisor:
Dr. K.M. Azharul HasanProfessor
Dept of CSE, KUET.
Domain Specific Question Answering
System Using N-gram Approach
Project/Thesis CSE 4000
Outline Introduction
Objective
Problem Statement
Scope of thesis
Theoretical Consideration
POS Tagger
N-Gram
Q/A System Using N-gram Approach
Experimental Analysis
Project/Thesis CSE 4000
Introduction
Project/Thesis CSE 4000
Objective
User wants specific answers rather than full text
documents or best-matching passages.
To find answers of factoid (people or places, or the
amounts of stuffs) questions by using domain specific
documents.
Project/Thesis CSE 4000
Problem Statement
Our system is a Q/A system which is a specific
type of information retrieval.
Given a text document, the system attempts to
find out the best matching answer to the question.
The output will be a sentence, not be any snippet
or any short answer.
Project/Thesis CSE 4000
Scope of the Thesis
WH- words: Who, What, When, Where, Which, Whom.
Domain specific document.
N-Gram mining approach.
Environment:
Eclipse Java EE IDE (Version: Luna Service Release 2 (4.4.2)),
jre 1.8.0_45
Stanford POS Tagger (Version 3.0.1).
List: Regular and irregular verb list, Synonym List
Project/Thesis CSE 4000
Theoretical Consideration
Project/Thesis CSE 4000
Q/A systems
Pattern based question answering system.
Ex. <NAME> was born on <ANSWER>
Key reference - AskMSR, a web based Q/A system.
Used N-gram mining, filtering and tiling for getting
answer.
Applied N-gram both for question and text sentences.
Project/Thesis CSE 4000
N-Gram N-grams are sequences of characters or words extracted
from a text.
Types -
1. Character based
2. Word based
An n-gram of size 1 is referred to as a Unigram; size 2 is a
Bigram; size 3 is a Trigram and so on.
Taj mahal is a world heritage site.
Bigrams are-
Taj mahal, mahal is, is a, a world, world heritage, heritage site
Trigrams are-
Taj mahal is, mahal is a, is a world, a world heritage, world heritage site
Project/Thesis CSE 4000
POS Tagger
POS Tagger is a software that reads text in some language
and assigns parts of speech to each word such as noun, verb,
adjective etc.
Stanford POS tagger is NLP based library which deals with
parts of speech detection of English language.
Input: I like watching movies.
Output:
I_PRP like_VBP watching_VBG movies_NNS
Project/Thesis CSE 4000
Q/A System Using N-gram
Project/Thesis CSE 4000
Steps of implementation
1. Domain specific question in GUI.
2. Splitting the Text files.
3. Query reformulation.
I. Change corresponding verb.
a. Do, Does, Did.
b. Regular or irregular.
c. Synonym word.
II. Find the Parts of speech from words in questions using POS Tagger.
III. Select Verb, Main Verb and Noun.
4. Verb, Main verb and Noun are matched with passage sentences
by N-Gram Mining.
5. Sentence of maximum match based on verb and main verb is
the answer. Project/Thesis CSE 4000
System in Brief…web-query-solution (filename, passageName, question)
begin
sSentence{} := get sentence from file,
qVerb{} := verb from Question,
qMainVerb{} := mainVerb from Question,
qNoun{} := noun from Question
if(NGram(sSentence) = NGram(qVerb OR qMainVerb OR qNoun)) then
begin
count verb, mainverb, noun and return.
end
max:= no. of verb and no. of mainverb
if(max is MAXIMUM) then return answer String
end
Fig: System Algorithm.Project/Thesis CSE 4000
User Input
Project/Thesis CSE 4000
Experimental Analysis
Project/Thesis CSE 4000
Case Study 1
The Taj Mahal is a white marble mausoleum. It is located in Agra, Uttar
Pradesh, India. Mughal emperor Shah Jahan built Taj mahal in memory of
his third wife, Mumtaz Mahal. The Taj Mahal is widely recognized as "the
jewel of Muslim art in India". In 1983, the Taj Mahal became a UNESCO
World Heritage Site. The construction began around 1632. The construction
was completed around 1653. The architects of Taj mahal are Abd ul-Karim
Ma'mur Khan, Makramat Khan, and Ustad Ahmad Lahauri. Lahauri is
generally considered to be the principal designer. In 1631, Shah Jahan was
grief-stricken for the death of his wife. Mumtaz Mahal was Shah Jahan's
third wife and a Persian princess. Mumtaz died during the birth of their 14th
child, Gauhara Begum.
Project/Thesis CSE 4000
Case Study 2 and 3
Child Labour
Bangladesh Cricket team
Project/Thesis CSE 4000
Output Ranking
Excellent
Satisfactory
Bad
Project/Thesis CSE 4000
Experimental AnalysisNo Q/A Rank
1. Q.Where is Taj mahal located in?
Ans: taj mahal is located in agra uttar pradesh india Excellent
2. Q.What is Taj mahal?
Ans: the taj mahal is widely recognized as the jewel of muslim art in india.
Satisfactory
3. Q.Who was mumtaj mahal?
Ans: in 1631 shah jahan was grief-stricken for the death of his
Bad
4. Q.When did the construction begin?
Ans: the construction began around 1632.
Excellent
5. Q.Who is the principal designer?
Ans: the taj mahal is widely recognized as the jewel of muslim art in india.
Bad
Project/Thesis CSE 4000
Experimental ResultsCase study 1: Taj Mahal : 32 questions. From those:
Excellent: 15, so, 46.87%
Satisfactory: 14, so, 43.75%
Bad: 3, so, 9.38%
Case study 2: Child Labour :14 questions. From those:
Excellent: 6, so, 42.86%
Satisfactory: 1, so, 7.14%
Bad: 7, so, 50%
Case study 3: Bangladesh Cricket Team : 24 questions. From those:
Excellent: 14, so, 58.33%
Satisfactory: 0, so, 0%
Bad: 10, so, 41.67%
Project/Thesis CSE 4000
Accuracy Measure
Total question asked = 32 + 14 + 24 = 70 questions
Among those,
Excellent answers = 15 + 6 + 14 = 35
Satisfactory answers = 14 + 1 + 0 = 15
Bad Answers = 3 + 7 + 10 = 20
Percentage of Excellent answers = 50%
Percentage of Satisfactory answers = 21.43%
Percentage of Bad answers = 28.57%
Project/Thesis CSE 4000
Limitations
Deals with simple sentences only.
Does not handle antonyms, spell checking.
Not domain independent.
Complex questions can not be handled.
Project/Thesis CSE 4000
Conclusion
While implementing the system we faced difficulties. A
lot can be done to make the system domain
independent. We can implement more linguistic
features. These will make the system more robust.
Project/Thesis CSE 4000
Thank You.
Project/Thesis CSE 4000