natural language processing for information retrieval -kvmv kiran (04005031) -neeraj bisht...
Post on 05-Jan-2016
213 Views
Preview:
TRANSCRIPT
Natural Language Processing
for Information Retrieval
-KVMV Kiran (04005031)
-Neeraj Bisht (04005035)
-L.Srikanth (04005029)
OUTLINE
What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A
What is Information Retrieval? Retrieving information media with information
content that is relevant to a user's information
need.
Information media can be Text, documents, images, videos
Used for Searching Organization
OUTLINE
What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A
Approaches to IR
Two types of retrieval By metadata (subject, heading, keywords etc) By content
Metadata Manually assigned Automatically assigned
Content based IR is more successful of the two.
OUTLINE
What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A
Evaluation of IR methods
Precision: Proportion of retrieved set that is
relevant Precision = |relevant & retrieved|/|retrieved|
= P(relevant|retrieved)
Recall : Probability that a relevant document is
retrieved by the query Recall = |relevant & retrieved|/|relevant|
= P(retrieved|relevant|
Example
1000 documents, 400 relevant and 600 non-
relevant to a query. An IR procedure retrieves 75 relevant and 25 non-
relevant documents. Precision – 0.75 Recall - 75/400
Evaluating IR methods
Trivial to have recall of one Precision tends to decrease as recall increases A good IR procedure should have both of them
high.
Content based IR
Two approaches Statistical Linguistic
OUTLINE
What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A
Statistical IR
simple focus based on the "bag of words." all words in a document are treated as its index
terms each term assigned a weight in function of its
importance, usually determined by its appearance
frequency pairing the documents' words with that of the
query's
Statistical IR(cont..)
Stages in Statistical IR: Document Preprocessing
consisting in preparing the documents for its
parameterisation, eliminating any elements considered as
superfluous. Parametrisation
once the relevant terms have been identified. This consists in
quantifying the document's characteristics (that is, the
terms).
Statistical IR(cont..)
An Example- an xml document.
Statistical IR(cont..)
Preprocessing phases remove elements that are not meant for indexing,such
as tags and headers
Statistical IR(cont..)
Text standardising Uncapitalize Remove numerals and dates Remove words in Stopword lists
a list of empty words in a terms list (prepositions, determiners,
pronouns, etc.) considered to have little semantic value Identify n-grams
identify words that are usually together (compound words, proper
nouns, etc.) to be able to process them as a single conceptual unit done by estimating the probability of two words that are often
together make up a single term (compound) .e,g, Artificial
Intelligence, European Union etc
Statistical IR(cont..)
Statistical IR(cont..)
Stemming Remove suffixes (prefixes) to find the root of the words.
Statistical IR(cont..)
Parameterising the document assign a weight to each one of the relevant terms
associated to a document (usually by appearance
frequency)
Statistical IR(cont..)
Estimate the importance of a term TF*IDF (Term frequency * Inverse Document
Frequency) Term Frequency
a term appears often in one document is indicative that that
term is representative of the content Inverse Document frequency
If it appeared frequently in all documents, it would not have
any discriminatory value
Drawbacks of Statistical IR
Linguistic Variance : Synonyms - Different words convey the same meaning Might provoke document silence Relevant documents might not be retrieved, recall
decreased Linguistic Ambiguity :
Homograph - Same word different meaning Will provoke document noise Might retrieve too many documents, relating to each
meaning of the word, precision decreased
Summary
Statistical IR treats documents as bag of words. Does not take into consideration the linguistics of
the language Need for more linguistics based approach using
complex NLP techniques.
OUTLINE
What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A
Linguistic IR
The documents are analysed through different
linguistic levels by linguistic tools that
incorporate each level's own annotations to the
text The techniques involved are:- Morphological analysis
taggers assign each word to a grammatical category
Linguistic IR (cont..)
Syntax analysis see how words are related and used together in making
larger grammatical units, phrases and sentences restricted to identify the most meaningful structures:
nominal sentences.
Linguistic IR (cont..)
Word Sense Disambiguation Index by concept rather than words e.g.Bank as a financial institution, bank as the edge of
a river. Disambiguation helps for queries like “Runs on
a bank” one of the most often used tools for word sense
disambiguation is the lexicographic database WordNet an annotated semantic lexicon in different languages
made up of synonym groups called SYNSETS groups.
Linguistic IR (cont..)
Synsets provide short definitions along with the
different semantic relationships between synonym 23 synsets for stock, including
broth, stock livestock, stock, farm animal stock certificate, stock stock, gillyflower stock, carry, stockpile (verb) standard, stock (adjective)
Linguistic IR (cont..)
Use of synsets For each query word, find its synsets
Query “punch recipes” punch (3 synsets), recipe (1 synset)
Expand that synset into its “neighborhood” Grow with WordNet hyponym (is part of) relationships until
any additional growth would include a different sense of any
word in the core synset To disambiguate words in a document
Look at all synset neighborhoods for words in document Compare to the way they overlap throughout collection
Linguistic IR (cont..)
Choose the neighborhoods where local activity is greater
than expected global activity
Problems with Linguistic
techniques in IR
Linguistic techniques must be essentially perfect
to help Queries are difficult Non-linguistic techniques implicitly exploit
linguistic knowledge
Conclusion
Statistical IR methods have some drawbacks Linguistic IR methods try to solve those problems
have been fairly unsuccessful Effective IR depends upon properties of queries
that make some NLP techniques redundant Current NLP techniques are not of much help in
strict document retrieval.
Q&A
References
Natural Language Processing and Information
Retrieval (Ellen M. Voorhes) Natural Language Processing in Textual
Information Retrieval and Related Topics by Mari
Vallez; Rafael Pedraza-Jimenez
(http://www.hipertext.net/english/pag1025.htm) NLP for IR by James Allan
http://citeseer.ist.psu.edu/308641.html
References (Contd..)
“A lecture on information retrieval” by Douglas
W. Oard
(http://www.glue.umd.edu/~oard/papers/CMSC72
3.ppt)
top related