natural language processing and information retrieval dr. eleni galiotou assistant professor,...

29
NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Upload: anabel-burns

Post on 13-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

NATURAL LANGUAGE PROCESSING AND

INFORMATION RETRIEVAL

Dr. Eleni Galiotou

Assistant Professor,

Department of Informatics,

TEI of Athens

Page 2: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 2

(Computer-based) Information Retrieval

• Locate (electronically available) documents satisfying user´s information needs

• Information need: A statement in a query language matched against document surrogates (title, abstract, keywords etc)

• Outcome of IR process: articles, memos, reports, books, annotated image and sound files

Page 3: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 3

The IR strategy• Purpose:

– Retrieve all relevant documents– Retrieve as few of non-relevant documents as possible

• Techniques in classical IR: – Empirical and ad-hoc– Quantitative methods

• IR : also a Natural Language Processing problem• Heterogeneous Collections of full-text documents

– Need for Content Understanding => NLP techniques

Page 4: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 4

Main areas of research in IR

• Content analysis

• Relationships between documents to improve efficiency and effectiveness of IR strategies

• Measurement of effectiveness of retrieval

Page 5: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 5

Example: The vector-space model (1)

• The SMART Text Retrieval System

• Documents and queries represented as vectors in T-dimensional space ( T: number of distinct terms in document collection)

• Automated indexing: Assigning of terms to a piece of text

• Weighted terms to reflect their relative importance in the text

Page 6: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 6

The vector-space model (2)

• Result of a query: Ranked list of documents ordered by similarity to the query

• Similarity measure: cosine of the angle formed by the query and the document vector (cosine correlation)

qidi

sim = cos(Q,D) =

( qi2 di

2 )1/2

T

i=1

T

i=1

T

i=1

Page 7: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 7

Extended vector-space model

• Vector : collection of subvectors used to represent different aspects of documents in collection

• Overall similarity between two extended vectors:

sim(Q,D) = αi simi (Qi , Di)

subvector i

ai = importance of subvector i in the overall similarity between texts

Page 8: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 8

Indexing

• Index language used to describe documents and requests

• Pre-coordinate index terms : logical combination of any index terms used as a label to identify a class of documents

• Post-coordinate terms: combination of classes of documents labeled with the individual index terms

Page 9: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 9

LMI vs. NLI• Non-Linguistic Indexing: Removing

stopwords, Applying statistical criteria

• Linguistically Motivated Indexing: – Applying syntactic and/or semantic techniques

for term identification and description formation– Identifying multi-word units and characterizing

their internal structure

• NLP needed for automated indexing ?!

Page 10: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 10

Index Term Weighting

• tf (t): within-document frequency of term t• idf(t): inverse document frequency = log (N/n) N= total number of documents in collection n : number of documents containing term t• General weighting schema: w(t) = tf(t) X idf(t) • Assumptions on term independence often false• Situation worse when single-word terms are

intermixed with phrasal terms

Page 11: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 11

NLP Based Indexing• Example: TREC Experiments– “joint venture” important in Wall Street

Journal database– “joint”, “venture” dropped from list of

terms by the system because of too low idf • Identify groups creating meaningful phrases• Simple collocations, Statistically-validated

N-grams, Part-Of-Speech tagged sequences, Syntactic structures, Semantic Concepts

Page 12: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 12

Obstacles in the application of NLP techniques in IR

• Lack of robustness and efficiency

• Representations produced : Complex structures effectively compared to determine relevance

Solution: Use NLP to assist IR system (boolean, statistical, probabilistic) in representing documents for search purposes

Off-line database indexing

Page 13: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 13

Stream-based IR Model (1)

• Combination of Statistical and NLP Techniques

• Term Extraction Steps

1. Elimination of Stopwords (no-content or low content words: determiners, preposition, pronouns, very frequent words)

2. Morphological Stemming: Affix-stripping process or Morphological Analysis)

Page 14: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 14

Stream-based IR Model (2)

4. Phrase Extraction : Shallow text processing techniques (POS tagging, Phrase boundary detection, Word co-occurrence metrics) used to identify relatively stable groups of words

5. Phrase Normalization: “Head+Modifier” pairs to normalize across syntactic variants and reduce to a common “concept” , e.g. weapon proliferation, proliferation of weapons weapon+ proliferate

6. Proper Name Extraction: People names and titles, Location names, Organization names used for indexing

Page 15: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 15

Stream-based IR Model (3)

• Final results: merge ranked lists of documents obtained from searching all streams with appropriately preprocessed queries.

• Contributions from each stream are weighted using an effective combination of alternative retrieval and routing methods

Meta-search strategy which maximizes contributions of each stream (base search engines: SMART v. 11, PRISE v.2 e.t.c)

Page 16: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 16

Advantages of Stream Architecture

• Easier to compare contributions of different indexing features or representations

• Convenient testbed to experiment with algorithms designed to merge results obtained using different IR engines and/ot techniques

• Easier to fine-tune system in order to obtain optimum performance

• Allows usage of IR engines without having to adopt them

Page 17: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 17

Part Of Speech Tagging (1)

• Allows resolution of lexical ambiguities in a running text assuming a known general type of text and a context in which a word is used

more accurate lexical normalization, phrase boundary detection

• Assigns POS label(s) to each word in a text depending on labels assigned to preceding words

Page 18: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 18

POS Tagging (2)

• Best-tag-only option: Only top-ranked for each word is output

gain in speed and robustness of subsequent processes (e.g. parsing)

• Brill´s rule based Tagger trained on Wall Street Journal texts to preprocess linguistic streams used by SMART

Page 19: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 19

Syntactic Tagging (1)• Capturing semantic dependencies critical for

accurate text indexing• Need to exploit syntactic structures produced by a

fairly comprehensive parser• TREC experiment: TTP (Tagged Text Parser)

based on Linguistic String Grammar• Full grammar parser with a built-in timer

regulating amount of time allowed for parsing a sentence

Page 20: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 20

Syntactic Tagging (2)

• If no parse is returned before allotated time elapses parser in “skip-and-fit” mode

• Result: approximate parse

• Fragments skipped in first pass:

– analyzed by simple phrasal parser looking for noun phrases and relative clauses

– attached to main parse structure

Page 21: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 21

Corpus-based disambiguation of long Noun Phrases (1)

• Relationships between in complex phases required to decompose longer phrases into meaningful head+modifier pairs

• Pair extractor looks at distribution statistics of compound terms

– association between any two words in noun phrase syntactically valid and semantically significant

Page 22: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 22

Corpus-based disambiguation (2)

• Phrasal terms extracted in two phases:1. Only unambiguous head-modifier pairs are

generated2. Distributional statistics gathered in first phase

are used to predict the strength of alternative modifier-modified links within ambiguous phrases

• Example: multiple unambiguous occurrences : “inside trading”, a few of “trading case”, numerous phrases: “insider trading case”, “insider trading legislation”

Page 23: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 23

Language Resources

• Machine Readable Dictionaries (MRD)

Mixed results in experiments

• Knowledge bases

– CYC : Huge Knowledge base of Common Sense Knowledge, Untested contribution to IR

– WordNet : Models Lexical Knowledge of a native user of English

Page 24: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 24

Usage of WordNet in Information Retrieval Tasks (1)• WordNet: organized around logical

groupings of related terms (synsets)

• Synset: list of synonymous word forms and semantic pointers describing relationships between current and other synsets

• Knowledge Base: Nouns in WordNet

• Nouns: Most content-bearing of all word classes and occur in every sentence

Page 25: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 25

Usage of WordNet (2)

• WordNet partitioned into Hierarchical Concept Graphs (HCG) based on the IS-A hierarchical links between synsets

• Information content of each synset approximated by estimating the probability of occurrence of all nouns in all subordinate synsets.

• Semantic similarity between two nouns (synsets from which the nouns are drawn): information content of first synset which subsumes the two synsets

Page 26: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 26

Usage of WordNet (3)

• Simple word sense disambiguation process for documents which choose the single most likely sense of a noun occurrence

• Experiments: Top 1000 documents pre-fetched from the collection using term weighting (conventional IR technique) and exhaustive word distance based measure on these documents

Page 27: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 27

Usage of WordNet (4)

• Retrieval effectiveness results using word-word distances, (in terms of precision and recall): poor compared to the tf X idf term weighting strategy

• Possibility of errors in syntactic tagging of documents, in word sense disambiguation, in semantic matching between words.

Page 28: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 28

Other Roles for NLP (1)

• Routing (Filtering): Amount of training data is the dominant factor in performance

• Text categorization (automatic assignment: to prior headings): Using complex terms had no extra beneficial effect

• ?Real contribution to selective content-based information management

Page 29: NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens

Feb 24, 2004 Eleni Galiotou : Tempus Seminar 29

Other Roles for NLP (2)

• Displaying information about whole documents: giving selected phrases more informative than highlighting matching terms or listing key individual words

Information Extraction and Summarizing

• Real role of NLP: Supporting more exigent information-management functions within a larger, multi-functional whole