natural language processing for information retrieval -kvmv kiran (04005031) -neeraj bisht...

Natural Language Processing

for Information Retrieval

-KVMV Kiran (04005031)

-Neeraj Bisht (04005035)

-L.Srikanth (04005029)

OUTLINE

What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A

What is Information Retrieval? Retrieving information media with information

content that is relevant to a user's information

Information media can be Text, documents, images, videos

Used for Searching Organization

OUTLINE

Approaches to IR

Two types of retrieval By metadata (subject, heading, keywords etc) By content

Metadata Manually assigned Automatically assigned

Content based IR is more successful of the two.

OUTLINE

Evaluation of IR methods

Precision: Proportion of retrieved set that is

relevant Precision = |relevant & retrieved|/|retrieved|

= P(relevant|retrieved)

Recall : Probability that a relevant document is

retrieved by the query Recall = |relevant & retrieved|/|relevant|

= P(retrieved|relevant|

Example

1000 documents, 400 relevant and 600 non-

relevant to a query. An IR procedure retrieves 75 relevant and 25 non-

relevant documents. Precision – 0.75 Recall - 75/400

Evaluating IR methods

Trivial to have recall of one Precision tends to decrease as recall increases A good IR procedure should have both of them

Content based IR

Two approaches Statistical Linguistic

OUTLINE

Statistical IR

simple focus based on the "bag of words." all words in a document are treated as its index

terms each term assigned a weight in function of its

importance, usually determined by its appearance

frequency pairing the documents' words with that of the

query's

Statistical IR(cont..)

Stages in Statistical IR: Document Preprocessing

consisting in preparing the documents for its

parameterisation, eliminating any elements considered as

superfluous. Parametrisation

once the relevant terms have been identified. This consists in

quantifying the document's characteristics (that is, the

terms).

An Example- an xml document.

Preprocessing phases remove elements that are not meant for indexing,such

as tags and headers

Text standardising Uncapitalize Remove numerals and dates Remove words in Stopword lists

a list of empty words in a terms list (prepositions, determiners,

pronouns, etc.) considered to have little semantic value Identify n-grams

identify words that are usually together (compound words, proper

nouns, etc.) to be able to process them as a single conceptual unit done by estimating the probability of two words that are often

together make up a single term (compound) .e,g, Artificial

Intelligence, European Union etc

Stemming Remove suffixes (prefixes) to find the root of the words.

Parameterising the document assign a weight to each one of the relevant terms

associated to a document (usually by appearance

frequency)

Estimate the importance of a term TF*IDF (Term frequency * Inverse Document

Frequency) Term Frequency

a term appears often in one document is indicative that that

term is representative of the content Inverse Document frequency

If it appeared frequently in all documents, it would not have

any discriminatory value

Drawbacks of Statistical IR

Linguistic Variance : Synonyms - Different words convey the same meaning Might provoke document silence Relevant documents might not be retrieved, recall

decreased Linguistic Ambiguity :

Homograph - Same word different meaning Will provoke document noise Might retrieve too many documents, relating to each

meaning of the word, precision decreased

Summary

Statistical IR treats documents as bag of words. Does not take into consideration the linguistics of

the language Need for more linguistics based approach using

complex NLP techniques.

OUTLINE

Linguistic IR

The documents are analysed through different

linguistic levels by linguistic tools that

incorporate each level's own annotations to the

text The techniques involved are:- Morphological analysis

taggers assign each word to a grammatical category

Linguistic IR (cont..)

Syntax analysis see how words are related and used together in making

larger grammatical units, phrases and sentences restricted to identify the most meaningful structures:

nominal sentences.

Word Sense Disambiguation Index by concept rather than words e.g.Bank as a financial institution, bank as the edge of

a river. Disambiguation helps for queries like “Runs on

a bank” one of the most often used tools for word sense

disambiguation is the lexicographic database WordNet an annotated semantic lexicon in different languages

made up of synonym groups called SYNSETS groups.

Synsets provide short definitions along with the

different semantic relationships between synonym 23 synsets for stock, including

broth, stock livestock, stock, farm animal stock certificate, stock stock, gillyflower stock, carry, stockpile (verb) standard, stock (adjective)

Use of synsets For each query word, find its synsets

Query “punch recipes” punch (3 synsets), recipe (1 synset)

Expand that synset into its “neighborhood” Grow with WordNet hyponym (is part of) relationships until

any additional growth would include a different sense of any

word in the core synset To disambiguate words in a document

Look at all synset neighborhoods for words in document Compare to the way they overlap throughout collection

Choose the neighborhoods where local activity is greater

than expected global activity

Problems with Linguistic

techniques in IR

Linguistic techniques must be essentially perfect

to help Queries are difficult Non-linguistic techniques implicitly exploit

linguistic knowledge

Conclusion

Statistical IR methods have some drawbacks Linguistic IR methods try to solve those problems

have been fairly unsuccessful Effective IR depends upon properties of queries

that make some NLP techniques redundant Current NLP techniques are not of much help in

strict document retrieval.

References

Natural Language Processing and Information

Retrieval (Ellen M. Voorhes) Natural Language Processing in Textual

Information Retrieval and Related Topics by Mari

Vallez; Rafael Pedraza-Jimenez

(http://www.hipertext.net/english/pag1025.htm) NLP for IR by James Allan

http://citeseer.ist.psu.edu/308641.html

References (Contd..)

“A lecture on information retrieval” by Douglas

W. Oard

(http://www.glue.umd.edu/~oard/papers/CMSC72

3.ppt)

natural language processing for information retrieval -kvmv kiran (04005031) -neeraj bisht...

Documents

ecms altanai bisht , college 3rd year

neeraj report

11 neeraj kapoor

lease financing(suman bisht)

geeta bisht horoscope interpretation

molhr & bisht group

neeraj kakkar

check list 2020 - neeraj publication | neeraj publications

neeraj kohli

ontarget neeraj

risk mgt .neeraj

scanned by camscanner · pushpa rawat saroj bala neetu...

automatically generated pdf from existing...

by: shikha bisht

cs 621 reinforcement learning group 8 neeraj bisht ranjeet...

neeraj graph

dr. kamlesh bisht

britannia rahul bisht

pankaj bisht

neeraj thakur