natural language processing and information retrieval dr. eleni galiotou assistant professor,...
TRANSCRIPT
NATURAL LANGUAGE PROCESSING AND
INFORMATION RETRIEVAL
Dr. Eleni Galiotou
Assistant Professor,
Department of Informatics,
TEI of Athens
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 2
(Computer-based) Information Retrieval
• Locate (electronically available) documents satisfying user´s information needs
• Information need: A statement in a query language matched against document surrogates (title, abstract, keywords etc)
• Outcome of IR process: articles, memos, reports, books, annotated image and sound files
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 3
The IR strategy• Purpose:
– Retrieve all relevant documents– Retrieve as few of non-relevant documents as possible
• Techniques in classical IR: – Empirical and ad-hoc– Quantitative methods
• IR : also a Natural Language Processing problem• Heterogeneous Collections of full-text documents
– Need for Content Understanding => NLP techniques
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 4
Main areas of research in IR
• Content analysis
• Relationships between documents to improve efficiency and effectiveness of IR strategies
• Measurement of effectiveness of retrieval
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 5
Example: The vector-space model (1)
• The SMART Text Retrieval System
• Documents and queries represented as vectors in T-dimensional space ( T: number of distinct terms in document collection)
• Automated indexing: Assigning of terms to a piece of text
• Weighted terms to reflect their relative importance in the text
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 6
The vector-space model (2)
• Result of a query: Ranked list of documents ordered by similarity to the query
• Similarity measure: cosine of the angle formed by the query and the document vector (cosine correlation)
qidi
sim = cos(Q,D) =
( qi2 di
2 )1/2
T
i=1
T
i=1
T
i=1
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 7
Extended vector-space model
• Vector : collection of subvectors used to represent different aspects of documents in collection
• Overall similarity between two extended vectors:
sim(Q,D) = αi simi (Qi , Di)
subvector i
ai = importance of subvector i in the overall similarity between texts
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 8
Indexing
• Index language used to describe documents and requests
• Pre-coordinate index terms : logical combination of any index terms used as a label to identify a class of documents
• Post-coordinate terms: combination of classes of documents labeled with the individual index terms
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 9
LMI vs. NLI• Non-Linguistic Indexing: Removing
stopwords, Applying statistical criteria
• Linguistically Motivated Indexing: – Applying syntactic and/or semantic techniques
for term identification and description formation– Identifying multi-word units and characterizing
their internal structure
• NLP needed for automated indexing ?!
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 10
Index Term Weighting
• tf (t): within-document frequency of term t• idf(t): inverse document frequency = log (N/n) N= total number of documents in collection n : number of documents containing term t• General weighting schema: w(t) = tf(t) X idf(t) • Assumptions on term independence often false• Situation worse when single-word terms are
intermixed with phrasal terms
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 11
NLP Based Indexing• Example: TREC Experiments– “joint venture” important in Wall Street
Journal database– “joint”, “venture” dropped from list of
terms by the system because of too low idf • Identify groups creating meaningful phrases• Simple collocations, Statistically-validated
N-grams, Part-Of-Speech tagged sequences, Syntactic structures, Semantic Concepts
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 12
Obstacles in the application of NLP techniques in IR
• Lack of robustness and efficiency
• Representations produced : Complex structures effectively compared to determine relevance
Solution: Use NLP to assist IR system (boolean, statistical, probabilistic) in representing documents for search purposes
Off-line database indexing
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 13
Stream-based IR Model (1)
• Combination of Statistical and NLP Techniques
• Term Extraction Steps
1. Elimination of Stopwords (no-content or low content words: determiners, preposition, pronouns, very frequent words)
2. Morphological Stemming: Affix-stripping process or Morphological Analysis)
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 14
Stream-based IR Model (2)
4. Phrase Extraction : Shallow text processing techniques (POS tagging, Phrase boundary detection, Word co-occurrence metrics) used to identify relatively stable groups of words
5. Phrase Normalization: “Head+Modifier” pairs to normalize across syntactic variants and reduce to a common “concept” , e.g. weapon proliferation, proliferation of weapons weapon+ proliferate
6. Proper Name Extraction: People names and titles, Location names, Organization names used for indexing
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 15
Stream-based IR Model (3)
• Final results: merge ranked lists of documents obtained from searching all streams with appropriately preprocessed queries.
• Contributions from each stream are weighted using an effective combination of alternative retrieval and routing methods
Meta-search strategy which maximizes contributions of each stream (base search engines: SMART v. 11, PRISE v.2 e.t.c)
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 16
Advantages of Stream Architecture
• Easier to compare contributions of different indexing features or representations
• Convenient testbed to experiment with algorithms designed to merge results obtained using different IR engines and/ot techniques
• Easier to fine-tune system in order to obtain optimum performance
• Allows usage of IR engines without having to adopt them
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 17
Part Of Speech Tagging (1)
• Allows resolution of lexical ambiguities in a running text assuming a known general type of text and a context in which a word is used
more accurate lexical normalization, phrase boundary detection
• Assigns POS label(s) to each word in a text depending on labels assigned to preceding words
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 18
POS Tagging (2)
• Best-tag-only option: Only top-ranked for each word is output
gain in speed and robustness of subsequent processes (e.g. parsing)
• Brill´s rule based Tagger trained on Wall Street Journal texts to preprocess linguistic streams used by SMART
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 19
Syntactic Tagging (1)• Capturing semantic dependencies critical for
accurate text indexing• Need to exploit syntactic structures produced by a
fairly comprehensive parser• TREC experiment: TTP (Tagged Text Parser)
based on Linguistic String Grammar• Full grammar parser with a built-in timer
regulating amount of time allowed for parsing a sentence
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 20
Syntactic Tagging (2)
• If no parse is returned before allotated time elapses parser in “skip-and-fit” mode
• Result: approximate parse
• Fragments skipped in first pass:
– analyzed by simple phrasal parser looking for noun phrases and relative clauses
– attached to main parse structure
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 21
Corpus-based disambiguation of long Noun Phrases (1)
• Relationships between in complex phases required to decompose longer phrases into meaningful head+modifier pairs
• Pair extractor looks at distribution statistics of compound terms
– association between any two words in noun phrase syntactically valid and semantically significant
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 22
Corpus-based disambiguation (2)
• Phrasal terms extracted in two phases:1. Only unambiguous head-modifier pairs are
generated2. Distributional statistics gathered in first phase
are used to predict the strength of alternative modifier-modified links within ambiguous phrases
• Example: multiple unambiguous occurrences : “inside trading”, a few of “trading case”, numerous phrases: “insider trading case”, “insider trading legislation”
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 23
Language Resources
• Machine Readable Dictionaries (MRD)
Mixed results in experiments
• Knowledge bases
– CYC : Huge Knowledge base of Common Sense Knowledge, Untested contribution to IR
– WordNet : Models Lexical Knowledge of a native user of English
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 24
Usage of WordNet in Information Retrieval Tasks (1)• WordNet: organized around logical
groupings of related terms (synsets)
• Synset: list of synonymous word forms and semantic pointers describing relationships between current and other synsets
• Knowledge Base: Nouns in WordNet
• Nouns: Most content-bearing of all word classes and occur in every sentence
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 25
Usage of WordNet (2)
• WordNet partitioned into Hierarchical Concept Graphs (HCG) based on the IS-A hierarchical links between synsets
• Information content of each synset approximated by estimating the probability of occurrence of all nouns in all subordinate synsets.
• Semantic similarity between two nouns (synsets from which the nouns are drawn): information content of first synset which subsumes the two synsets
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 26
Usage of WordNet (3)
• Simple word sense disambiguation process for documents which choose the single most likely sense of a noun occurrence
• Experiments: Top 1000 documents pre-fetched from the collection using term weighting (conventional IR technique) and exhaustive word distance based measure on these documents
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 27
Usage of WordNet (4)
• Retrieval effectiveness results using word-word distances, (in terms of precision and recall): poor compared to the tf X idf term weighting strategy
• Possibility of errors in syntactic tagging of documents, in word sense disambiguation, in semantic matching between words.
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 28
Other Roles for NLP (1)
• Routing (Filtering): Amount of training data is the dominant factor in performance
• Text categorization (automatic assignment: to prior headings): Using complex terms had no extra beneficial effect
• ?Real contribution to selective content-based information management
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 29
Other Roles for NLP (2)
• Displaying information about whole documents: giving selected phrases more informative than highlighting matching terms or listing key individual words
Information Extraction and Summarizing
• Real role of NLP: Supporting more exigent information-management functions within a larger, multi-functional whole