natural language processing and information retrieval dr. eleni galiotou assistant professor,...

NATURAL LANGUAGE PROCESSING AND

INFORMATION RETRIEVAL

Dr. Eleni Galiotou

Assistant Professor,

Department of Informatics,

TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 2

(Computer-based) Information Retrieval

• Locate (electronically available) documents satisfying user´s information needs

• Information need: A statement in a query language matched against document surrogates (title, abstract, keywords etc)

• Outcome of IR process: articles, memos, reports, books, annotated image and sound files

The IR strategy• Purpose:

– Retrieve all relevant documents– Retrieve as few of non-relevant documents as possible

• Techniques in classical IR: – Empirical and ad-hoc– Quantitative methods

• IR : also a Natural Language Processing problem• Heterogeneous Collections of full-text documents

– Need for Content Understanding => NLP techniques

Main areas of research in IR

• Content analysis

• Relationships between documents to improve efficiency and effectiveness of IR strategies

• Measurement of effectiveness of retrieval

Example: The vector-space model (1)

• The SMART Text Retrieval System

• Documents and queries represented as vectors in T-dimensional space ( T: number of distinct terms in document collection)

• Automated indexing: Assigning of terms to a piece of text

• Weighted terms to reflect their relative importance in the text

The vector-space model (2)

• Result of a query: Ranked list of documents ordered by similarity to the query

• Similarity measure: cosine of the angle formed by the query and the document vector (cosine correlation)

sim = cos(Q,D) =

( qi2 di

2 )1/2

Extended vector-space model

• Vector : collection of subvectors used to represent different aspects of documents in collection

• Overall similarity between two extended vectors:

sim(Q,D) = αi simi (Qi , Di)

subvector i

ai = importance of subvector i in the overall similarity between texts

Indexing

• Index language used to describe documents and requests

• Pre-coordinate index terms : logical combination of any index terms used as a label to identify a class of documents

• Post-coordinate terms: combination of classes of documents labeled with the individual index terms

LMI vs. NLI• Non-Linguistic Indexing: Removing

stopwords, Applying statistical criteria

• Linguistically Motivated Indexing: – Applying syntactic and/or semantic techniques

for term identification and description formation– Identifying multi-word units and characterizing

their internal structure

• NLP needed for automated indexing ?!

Index Term Weighting

• tf (t): within-document frequency of term t• idf(t): inverse document frequency = log (N/n) N= total number of documents in collection n : number of documents containing term t• General weighting schema: w(t) = tf(t) X idf(t) • Assumptions on term independence often false• Situation worse when single-word terms are

intermixed with phrasal terms

NLP Based Indexing• Example: TREC Experiments– “joint venture” important in Wall Street

Journal database– “joint”, “venture” dropped from list of

terms by the system because of too low idf • Identify groups creating meaningful phrases• Simple collocations, Statistically-validated

N-grams, Part-Of-Speech tagged sequences, Syntactic structures, Semantic Concepts

Obstacles in the application of NLP techniques in IR

• Lack of robustness and efficiency

• Representations produced : Complex structures effectively compared to determine relevance

Solution: Use NLP to assist IR system (boolean, statistical, probabilistic) in representing documents for search purposes

Off-line database indexing

Stream-based IR Model (1)

• Combination of Statistical and NLP Techniques

• Term Extraction Steps

1. Elimination of Stopwords (no-content or low content words: determiners, preposition, pronouns, very frequent words)

2. Morphological Stemming: Affix-stripping process or Morphological Analysis)

4. Phrase Extraction : Shallow text processing techniques (POS tagging, Phrase boundary detection, Word co-occurrence metrics) used to identify relatively stable groups of words

5. Phrase Normalization: “Head+Modifier” pairs to normalize across syntactic variants and reduce to a common “concept” , e.g. weapon proliferation, proliferation of weapons weapon+ proliferate

6. Proper Name Extraction: People names and titles, Location names, Organization names used for indexing

• Final results: merge ranked lists of documents obtained from searching all streams with appropriately preprocessed queries.

• Contributions from each stream are weighted using an effective combination of alternative retrieval and routing methods

Meta-search strategy which maximizes contributions of each stream (base search engines: SMART v. 11, PRISE v.2 e.t.c)

Advantages of Stream Architecture

• Easier to compare contributions of different indexing features or representations

• Convenient testbed to experiment with algorithms designed to merge results obtained using different IR engines and/ot techniques

• Easier to fine-tune system in order to obtain optimum performance

• Allows usage of IR engines without having to adopt them

Part Of Speech Tagging (1)

• Allows resolution of lexical ambiguities in a running text assuming a known general type of text and a context in which a word is used

more accurate lexical normalization, phrase boundary detection

• Assigns POS label(s) to each word in a text depending on labels assigned to preceding words

POS Tagging (2)

• Best-tag-only option: Only top-ranked for each word is output

gain in speed and robustness of subsequent processes (e.g. parsing)

• Brill´s rule based Tagger trained on Wall Street Journal texts to preprocess linguistic streams used by SMART

Syntactic Tagging (1)• Capturing semantic dependencies critical for

accurate text indexing• Need to exploit syntactic structures produced by a

fairly comprehensive parser• TREC experiment: TTP (Tagged Text Parser)

based on Linguistic String Grammar• Full grammar parser with a built-in timer

regulating amount of time allowed for parsing a sentence

Syntactic Tagging (2)

• If no parse is returned before allotated time elapses parser in “skip-and-fit” mode

• Result: approximate parse

• Fragments skipped in first pass:

– analyzed by simple phrasal parser looking for noun phrases and relative clauses

– attached to main parse structure

Corpus-based disambiguation of long Noun Phrases (1)

• Relationships between in complex phases required to decompose longer phrases into meaningful head+modifier pairs

• Pair extractor looks at distribution statistics of compound terms

– association between any two words in noun phrase syntactically valid and semantically significant

Corpus-based disambiguation (2)

• Phrasal terms extracted in two phases:1. Only unambiguous head-modifier pairs are

generated2. Distributional statistics gathered in first phase

are used to predict the strength of alternative modifier-modified links within ambiguous phrases

• Example: multiple unambiguous occurrences : “inside trading”, a few of “trading case”, numerous phrases: “insider trading case”, “insider trading legislation”

Language Resources

• Machine Readable Dictionaries (MRD)

Mixed results in experiments

• Knowledge bases

– CYC : Huge Knowledge base of Common Sense Knowledge, Untested contribution to IR

– WordNet : Models Lexical Knowledge of a native user of English

Usage of WordNet in Information Retrieval Tasks (1)• WordNet: organized around logical

groupings of related terms (synsets)

• Synset: list of synonymous word forms and semantic pointers describing relationships between current and other synsets

• Knowledge Base: Nouns in WordNet

• Nouns: Most content-bearing of all word classes and occur in every sentence

Usage of WordNet (2)

• WordNet partitioned into Hierarchical Concept Graphs (HCG) based on the IS-A hierarchical links between synsets

• Information content of each synset approximated by estimating the probability of occurrence of all nouns in all subordinate synsets.

• Semantic similarity between two nouns (synsets from which the nouns are drawn): information content of first synset which subsumes the two synsets

• Simple word sense disambiguation process for documents which choose the single most likely sense of a noun occurrence

• Experiments: Top 1000 documents pre-fetched from the collection using term weighting (conventional IR technique) and exhaustive word distance based measure on these documents

• Retrieval effectiveness results using word-word distances, (in terms of precision and recall): poor compared to the tf X idf term weighting strategy

• Possibility of errors in syntactic tagging of documents, in word sense disambiguation, in semantic matching between words.

Other Roles for NLP (1)

• Routing (Filtering): Amount of training data is the dominant factor in performance

• Text categorization (automatic assignment: to prior headings): Using complex terms had no extra beneficial effect

• ?Real contribution to selective content-based information management

Other Roles for NLP (2)

• Displaying information about whole documents: giving selected phrases more informative than highlighting matching terms or listing key individual words

Information Extraction and Summarizing

• Real role of NLP: Supporting more exigent information-management functions within a larger, multi-functional whole

natural language processing and information retrieval dr. eleni galiotou assistant professor,...

tempus seminar nlp

tempus seminar lmi

tempus seminar example

phrasal terms eleni

text documents

text eleni galiotou

retrieval eleni galiotou

list of documents

Documents

maria kallergi, 1 john j. heine, 2 and ernest wollin 3 1 ...

6. athens, 1890-1940 - university of new...

department of informatics tei of athens 1 advanced...

eleni and free market

welcome to hillsboro news & notes … for both:...

the agro mac-vet environment & eue networks-projects:...

the changing model of teaching and learning multimedia...

tei p5 progress report - university of...

eleni - ntua

ifigenia founta associate professor department of...

tei header

ipp newsletter - issue 12 - european...

tei lite: encoding for interchange: an introduction to the...

graphic arts – media sector in greecegraphic arts –...

haverstraw king s daughters public library · eleni...

esperando a godot - ocw.uca.es · 1 esperando a godot...

buddhism by eleni georgiou

tei flashcards

toolkit eleni katrini

evaluating multimedia enhanced teaching at the department of...