information retrieval ling573 nlp systems and applications april 26, 2011

66
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Information RetrievalLing573

NLP Systems and ApplicationsApril 26, 2011

Page 2: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

RoadmapProblem:

Matching Topics and Documents

Methods:Classic: Vector Space Model

Challenge: Beyond literal matchingRelevance FeedbackExpansion Strategies

Page 3: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Matching Topics and Documents

Two main perspectives:Pre-defined, fixed, finite topics:

“Text Classification”

Page 4: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Matching Topics and Documents

Two main perspectives:Pre-defined, fixed, finite topics:

“Text Classification”

Arbitrary topics, typically defined by statement of information need (aka query)“Information Retrieval”

Ad-hoc retrieval

Page 5: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Information Retrieval Components

Document collection:Used to satisfy user requests, collection of:

Page 6: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Information Retrieval Components

Document collection:Used to satisfy user requests, collection of:Documents:

Basic unit available for retrieval

Page 7: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Information Retrieval Components

Document collection:Used to satisfy user requests, collection of:Documents:

Basic unit available for retrieval Typically: Newspaper story, encyclopedia entry

Page 8: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Information Retrieval Components

Document collection:Used to satisfy user requests, collection of:Documents:

Basic unit available for retrieval Typically: Newspaper story, encyclopedia entry Alternatively: paragraphs, sentences; web page, site

Page 9: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Information Retrieval Components

Document collection:Used to satisfy user requests, collection of:Documents:

Basic unit available for retrieval Typically: Newspaper story, encyclopedia entry Alternatively: paragraphs, sentences; web page, site

Query: Specification of information need

Page 10: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Information Retrieval Components

Document collection:Used to satisfy user requests, collection of:Documents:

Basic unit available for retrieval Typically: Newspaper story, encyclopedia entry Alternatively: paragraphs, sentences; web page, site

Query: Specification of information need

Terms:Minimal units for query/document

Page 11: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Information Retrieval Components

Document collection: Used to satisfy user requests, collection of: Documents:

Basic unit available for retrieval Typically: Newspaper story, encyclopedia entry Alternatively: paragraphs, sentences; web page, site

Query: Specification of information need

Terms: Minimal units for query/document

Words, or phrases

Page 12: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Information Retrieval Architecture

Page 13: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Space ModelBasic representation:

Document and query semantics defined by their terms

Page 14: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Space ModelBasic representation:

Document and query semantics defined by their terms

Typically ignore any syntaxBag-of-words (or Bag-of-terms)

Dog bites man == Man bites dog

Page 15: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Space ModelBasic representation:

Document and query semantics defined by their terms

Typically ignore any syntaxBag-of-words (or Bag-of-terms)

Dog bites man == Man bites dog

Represent documents and queries asVectors of term-based features

Page 16: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Space ModelBasic representation:

Document and query semantics defined by their termsTypically ignore any syntax

Bag-of-words (or Bag-of-terms) Dog bites man == Man bites dog

Represent documents and queries asVectors of term-based featuresE.g. N:

Page 17: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Space ModelBasic representation:

Document and query semantics defined by their terms Typically ignore any syntax

Bag-of-words (or Bag-of-terms) Dog bites man == Man bites dog

Represent documents and queries asVectors of term-based featuresE.g. N:

# of terms in vocabulary of collection: Problem?

Page 18: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

RepresentationSolution 1:

Binary features: w=1 if term present, 0 otherwise

Similarity:Number of terms in commonDot product

Issues?

Page 19: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

VSM WeightsWhat should the weights be?

“Aboutness”To what degree is this term what document is about?Within document measureTerm frequency (tf): # occurrences of t in doc j

Examples: Terms: chicken, fried, oil, pepper D1: fried chicken recipe: (8, 2, 7,4) D2: poached chick recipe: (6, 0, 0, 0) Q: fried chicken: (1, 1, 0, 0)

Page 20: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Space Model (II)Documents & queries:

Document collection: term-by-document matrix

View as vector in multidimensional spaceNearby vectors are related

Normalize for vector length

Page 21: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Space Model

Page 22: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Similarity Computation

Normalization:Improve over dot product

Capture weightsCompensate for document length

:

Page 23: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Similarity Computation

Normalization:Improve over dot product

Capture weightsCompensate for document length

Cosine similarity

Page 24: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Similarity Computation

Normalization:Improve over dot product

Capture weightsCompensate for document length

Cosine similarity

Identical vectors:

Page 25: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Similarity Computation

Normalization:Improve over dot product

Capture weightsCompensate for document length

Cosine similarity

Identical vectors: 1No overlap:

Page 26: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Similarity Computation

Normalization:Improve over dot product

Capture weightsCompensate for document length

Cosine similarity

Identical vectors: 1No overlap:

Page 27: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Vector Similarity Computation

Normalization:Improve over dot product

Capture weightsCompensate for document length

Cosine similarity

Identical vectors: 1No overlap: 0

Page 28: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term Weighting Redux“Aboutness”

Term frequency (tf): # occurrences of t in doc j

Page 29: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term Weighting Redux“Aboutness”

Term frequency (tf): # occurrences of t in doc jChicken: 6; Fried: 1 vs Chicken: 1; Fried: 6

Page 30: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term Weighting Redux“Aboutness”

Term frequency (tf): # occurrences of t in doc jChicken: 6; Fried: 1 vs Chicken: 1; Fried: 6

Question: what about ‘Representative’ vs ‘Giffords’?

Page 31: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term Weighting Redux“Aboutness”

Term frequency (tf): # occurrences of t in doc jChicken: 6; Fried: 1 vs Chicken: 1; Fried: 6

Question: what about ‘Representative’ vs ‘Giffords’?

“Specificity”How surprised are you to see this term?Collection frequencyInverse document frequency (idf):

)log(i

i n

Nidf

Page 32: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term Weighting Redux“Aboutness”

Term frequency (tf): # occurrences of t in doc jChicken: 6; Fried: 1 vs Chicken: 1; Fried: 6

Question: what about ‘Representative’ vs ‘Giffords’?

“Specificity”How surprised are you to see this term?Collection frequencyInverse document frequency (idf):

)log(i

i n

Nidf

Page 33: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Tf-idf SimilarityVariants of tf-idf prevalent in most VSM

Page 34: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term SelectionSelection:

Some terms are truly useless

Page 35: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term SelectionSelection:

Some terms are truly uselessToo frequent:

Appear in most documents

Page 36: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term SelectionSelection:

Some terms are truly uselessToo frequent:

Appear in most documents

Little/no semantic content

Page 37: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term SelectionSelection:

Some terms are truly uselessToo frequent:

Appear in most documents

Little/no semantic content Function words

E.g. the, a, and,…

Page 38: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term SelectionSelection:

Some terms are truly uselessToo frequent:

Appear in most documents

Little/no semantic content Function words

E.g. the, a, and,…

Indexing inefficiency:Store in inverted index:

For each term, identify documents where it appears ‘the’: every document is a candidate match

Page 39: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term SelectionSelection:

Some terms are truly uselessToo frequent:

Appear in most documents

Little/no semantic content Function words

E.g. the, a, and,…

Indexing inefficiency:Store in inverted index:

For each term, identify documents where it appears ‘the’: every document is a candidate match

Remove ‘stop words’ based on list Usually document-frequency based

Page 40: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term CreationToo many surface forms for same concepts

Page 41: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term CreationToo many surface forms for same concepts

E.g. inflections of words: verb conjugations, pluralProcess, processing, processed Same concept, separated by inflection

Page 42: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term CreationToo many surface forms for same concepts

E.g. inflections of words: verb conjugations, pluralProcess, processing, processed Same concept, separated by inflection

Stem terms: Treat all forms as same underlying

E.g., ‘processing’ -> ‘process’; ‘Beijing’ -> ‘Beije’

Issues:

Page 43: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Term CreationToo many surface forms for same concepts

E.g. inflections of words: verb conjugations, pluralProcess, processing, processed Same concept, separated by inflection

Stem terms: Treat all forms as same underlying

E.g., ‘processing’ -> ‘process’; ‘Beijing’ -> ‘Beije’

Issues: Can be too aggressive

AIDS, aids -> aid; stock, stocks, stockings -> stock

Page 44: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRBasic measures: Precision and Recall

Page 45: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRBasic measures: Precision and Recall

Relevance judgments:For a query, returned document is relevant or non-

relevantTypically binary relevance: 0/1

Page 46: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRBasic measures: Precision and Recall

Relevance judgments:For a query, returned document is relevant or non-

relevantTypically binary relevance: 0/1

T: returned documents; U: true relevant documentsR: returned relevant documentsN: returned non-relevant documents

Page 47: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRBasic measures: Precision and Recall

Relevance judgments:For a query, returned document is relevant or non-

relevantTypically binary relevance: 0/1

T: returned documents; U: true relevant documentsR: returned relevant documentsN: returned non-relevant documents

Page 48: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRIssue: Ranked retrieval

Return top 1K documents: ‘best’ first

Page 49: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRIssue: Ranked retrieval

Return top 1K documents: ‘best’ first10 relevant documents returned:

Page 50: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRIssue: Ranked retrieval

Return top 1K documents: ‘best’ first10 relevant documents returned:

In first 10 positions?

Page 51: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRIssue: Ranked retrieval

Return top 1K documents: ‘best’ first10 relevant documents returned:

In first 10 positions?In last 10 positions?

Page 52: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRIssue: Ranked retrieval

Return top 1K documents: ‘best’ first10 relevant documents returned:

In first 10 positions?In last 10 positions?Score by precision and recall – which is better?

Page 53: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRIssue: Ranked retrieval

Return top 1K documents: ‘best’ first10 relevant documents returned:

In first 10 positions?In last 10 positions?Score by precision and recall – which is better?

Identical !!! Correspond to intuition? NO!

Page 54: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Evaluating IRIssue: Ranked retrieval

Return top 1K documents: ‘best’ first10 relevant documents returned:

In first 10 positions?In last 10 positions?Score by precision and recall – which is better?

Identical !!! Correspond to intuition? NO!

Need rank-sensitive measures

Page 55: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Rank-specific P & R

Page 56: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Rank-specific P & RPrecisionrank: based on fraction of reldocs at rank

Recallrank: similarly

Page 57: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Rank-specific P & RPrecisionrank: based on fraction of reldocs at rank

Recallrank: similarly

Note: Recall is non-decreasing; Precision varies

Page 58: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Rank-specific P & RPrecisionrank: based on fraction of reldocs at rank

Recallrank: similarly

Note: Recall is non-decreasing; Precision varies

Issue: too many numbers; no holistic view

Page 59: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Rank-specific P & RPrecisionrank: based on fraction of reldocs at rank

Recallrank: similarly

Note: Recall is non-decreasing; Precision varies

Issue: too many numbers; no holistic viewTypically, compute precision at 11 fixed levels of

recall Interpolated precision:

Can smooth variations in precision

Page 60: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Interpolated Precision

Page 61: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Comparing SystemsCreate graph of precision vs recall

Averaged over queriesCompare graphs

Page 62: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Mean Average Precision (MAP)

Traverse ranked document list:Compute precision each time relevant doc found

Page 63: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Mean Average Precision (MAP)

Traverse ranked document list:Compute precision each time relevant doc found

Average precision up to some fixed cutoffRr: set of relevant documents at or above r

Precision(d) : precision at rank when doc d found

Page 64: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Mean Average Precision (MAP)

Traverse ranked document list:Compute precision each time relevant doc found

Average precision up to some fixed cutoffRr: set of relevant documents at or above r

Precision(d) : precision at rank when doc d found

Mean Average Precision: 0.6Compute average over all queries of these averages

Page 65: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Mean Average Precision (MAP)

Traverse ranked document list:Compute precision each time relevant doc found

Average precision up to some fixed cutoffRr: set of relevant documents at or above r

Precision(d) : precision at rank when doc d found

Mean Average Precision: 0.6Compute average of all queries of these averagesPrecision-oriented measure

Page 66: Information Retrieval Ling573 NLP Systems and Applications April 26, 2011

Mean Average Precision (MAP)

Traverse ranked document list: Compute precision each time relevant doc found

Average precision up to some fixed cutoffRr: set of relevant documents at or above r

Precision(d) : precision at rank when doc d found

Mean Average Precision: 0.6Compute average of all queries of these averagesPrecision-oriented measure

Single crisp measure: common TREC Ad-hoc