natural language processing and information retrieval dr. eleni galiotou assistant professor,...

Post on 13-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NATURAL LANGUAGE PROCESSING AND

INFORMATION RETRIEVAL

Dr. Eleni Galiotou

Assistant Professor,

Department of Informatics,

TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 2

(Computer-based) Information Retrieval

• Locate (electronically available) documents satisfying user´s information needs

• Information need: A statement in a query language matched against document surrogates (title, abstract, keywords etc)

• Outcome of IR process: articles, memos, reports, books, annotated image and sound files

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 3

The IR strategy• Purpose:

– Retrieve all relevant documents– Retrieve as few of non-relevant documents as possible

• Techniques in classical IR: – Empirical and ad-hoc– Quantitative methods

• IR : also a Natural Language Processing problem• Heterogeneous Collections of full-text documents

– Need for Content Understanding => NLP techniques

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 4

Main areas of research in IR

• Content analysis

• Relationships between documents to improve efficiency and effectiveness of IR strategies

• Measurement of effectiveness of retrieval

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 5

Example: The vector-space model (1)

• The SMART Text Retrieval System

• Documents and queries represented as vectors in T-dimensional space ( T: number of distinct terms in document collection)

• Automated indexing: Assigning of terms to a piece of text

• Weighted terms to reflect their relative importance in the text

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 6

The vector-space model (2)

• Result of a query: Ranked list of documents ordered by similarity to the query

• Similarity measure: cosine of the angle formed by the query and the document vector (cosine correlation)

qidi

sim = cos(Q,D) =

( qi2 di

2 )1/2

T

i=1

T

i=1

T

i=1

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 7

Extended vector-space model

• Vector : collection of subvectors used to represent different aspects of documents in collection

• Overall similarity between two extended vectors:

sim(Q,D) = αi simi (Qi , Di)

subvector i

ai = importance of subvector i in the overall similarity between texts

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 8

Indexing

• Index language used to describe documents and requests

• Pre-coordinate index terms : logical combination of any index terms used as a label to identify a class of documents

• Post-coordinate terms: combination of classes of documents labeled with the individual index terms

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 9

LMI vs. NLI• Non-Linguistic Indexing: Removing

stopwords, Applying statistical criteria

• Linguistically Motivated Indexing: – Applying syntactic and/or semantic techniques

for term identification and description formation– Identifying multi-word units and characterizing

their internal structure

• NLP needed for automated indexing ?!

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 10

Index Term Weighting

• tf (t): within-document frequency of term t• idf(t): inverse document frequency = log (N/n) N= total number of documents in collection n : number of documents containing term t• General weighting schema: w(t) = tf(t) X idf(t) • Assumptions on term independence often false• Situation worse when single-word terms are

intermixed with phrasal terms

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 11

NLP Based Indexing• Example: TREC Experiments– “joint venture” important in Wall Street

Journal database– “joint”, “venture” dropped from list of

terms by the system because of too low idf • Identify groups creating meaningful phrases• Simple collocations, Statistically-validated

N-grams, Part-Of-Speech tagged sequences, Syntactic structures, Semantic Concepts

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 12

Obstacles in the application of NLP techniques in IR

• Lack of robustness and efficiency

• Representations produced : Complex structures effectively compared to determine relevance

Solution: Use NLP to assist IR system (boolean, statistical, probabilistic) in representing documents for search purposes

Off-line database indexing

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 13

Stream-based IR Model (1)

• Combination of Statistical and NLP Techniques

• Term Extraction Steps

1. Elimination of Stopwords (no-content or low content words: determiners, preposition, pronouns, very frequent words)

2. Morphological Stemming: Affix-stripping process or Morphological Analysis)

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 14

Stream-based IR Model (2)

4. Phrase Extraction : Shallow text processing techniques (POS tagging, Phrase boundary detection, Word co-occurrence metrics) used to identify relatively stable groups of words

5. Phrase Normalization: “Head+Modifier” pairs to normalize across syntactic variants and reduce to a common “concept” , e.g. weapon proliferation, proliferation of weapons weapon+ proliferate

6. Proper Name Extraction: People names and titles, Location names, Organization names used for indexing

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 15

Stream-based IR Model (3)

• Final results: merge ranked lists of documents obtained from searching all streams with appropriately preprocessed queries.

• Contributions from each stream are weighted using an effective combination of alternative retrieval and routing methods

Meta-search strategy which maximizes contributions of each stream (base search engines: SMART v. 11, PRISE v.2 e.t.c)

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 16

Advantages of Stream Architecture

• Easier to compare contributions of different indexing features or representations

• Convenient testbed to experiment with algorithms designed to merge results obtained using different IR engines and/ot techniques

• Easier to fine-tune system in order to obtain optimum performance

• Allows usage of IR engines without having to adopt them

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 17

Part Of Speech Tagging (1)

• Allows resolution of lexical ambiguities in a running text assuming a known general type of text and a context in which a word is used

more accurate lexical normalization, phrase boundary detection

• Assigns POS label(s) to each word in a text depending on labels assigned to preceding words

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 18

POS Tagging (2)

• Best-tag-only option: Only top-ranked for each word is output

gain in speed and robustness of subsequent processes (e.g. parsing)

• Brill´s rule based Tagger trained on Wall Street Journal texts to preprocess linguistic streams used by SMART

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 19

Syntactic Tagging (1)• Capturing semantic dependencies critical for

accurate text indexing• Need to exploit syntactic structures produced by a

fairly comprehensive parser• TREC experiment: TTP (Tagged Text Parser)

based on Linguistic String Grammar• Full grammar parser with a built-in timer

regulating amount of time allowed for parsing a sentence

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 20

Syntactic Tagging (2)

• If no parse is returned before allotated time elapses parser in “skip-and-fit” mode

• Result: approximate parse

• Fragments skipped in first pass:

– analyzed by simple phrasal parser looking for noun phrases and relative clauses

– attached to main parse structure

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 21

Corpus-based disambiguation of long Noun Phrases (1)

• Relationships between in complex phases required to decompose longer phrases into meaningful head+modifier pairs

• Pair extractor looks at distribution statistics of compound terms

– association between any two words in noun phrase syntactically valid and semantically significant

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 22

Corpus-based disambiguation (2)

• Phrasal terms extracted in two phases:1. Only unambiguous head-modifier pairs are

generated2. Distributional statistics gathered in first phase

are used to predict the strength of alternative modifier-modified links within ambiguous phrases

• Example: multiple unambiguous occurrences : “inside trading”, a few of “trading case”, numerous phrases: “insider trading case”, “insider trading legislation”

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 23

Language Resources

• Machine Readable Dictionaries (MRD)

Mixed results in experiments

• Knowledge bases

– CYC : Huge Knowledge base of Common Sense Knowledge, Untested contribution to IR

– WordNet : Models Lexical Knowledge of a native user of English

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 24

Usage of WordNet in Information Retrieval Tasks (1)• WordNet: organized around logical

groupings of related terms (synsets)

• Synset: list of synonymous word forms and semantic pointers describing relationships between current and other synsets

• Knowledge Base: Nouns in WordNet

• Nouns: Most content-bearing of all word classes and occur in every sentence

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 25

Usage of WordNet (2)

• WordNet partitioned into Hierarchical Concept Graphs (HCG) based on the IS-A hierarchical links between synsets

• Information content of each synset approximated by estimating the probability of occurrence of all nouns in all subordinate synsets.

• Semantic similarity between two nouns (synsets from which the nouns are drawn): information content of first synset which subsumes the two synsets

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 26

Usage of WordNet (3)

• Simple word sense disambiguation process for documents which choose the single most likely sense of a noun occurrence

• Experiments: Top 1000 documents pre-fetched from the collection using term weighting (conventional IR technique) and exhaustive word distance based measure on these documents

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 27

Usage of WordNet (4)

• Retrieval effectiveness results using word-word distances, (in terms of precision and recall): poor compared to the tf X idf term weighting strategy

• Possibility of errors in syntactic tagging of documents, in word sense disambiguation, in semantic matching between words.

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 28

Other Roles for NLP (1)

• Routing (Filtering): Amount of training data is the dominant factor in performance

• Text categorization (automatic assignment: to prior headings): Using complex terms had no extra beneficial effect

• ?Real contribution to selective content-based information management

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 29

Other Roles for NLP (2)

• Displaying information about whole documents: giving selected phrases more informative than highlighting matching terms or listing key individual words

Information Extraction and Summarizing

• Real role of NLP: Supporting more exigent information-management functions within a larger, multi-functional whole

top related