algoritmi per ir ranking. the big fight: find the best ranking

64
Algoritmi per IR Ranking

Post on 21-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Algoritmi per IR Ranking. The big fight: find the best ranking

Algoritmi per IR

Ranking

Page 2: Algoritmi per IR Ranking. The big fight: find the best ranking

The big fight: find the best ranking...

Page 3: Algoritmi per IR Ranking. The big fight: find the best ranking

Ranking: Google vs Google.cn

Page 4: Algoritmi per IR Ranking. The big fight: find the best ranking

Algoritmi per IR

Text-based Ranking(1° generation)

Page 5: Algoritmi per IR Ranking. The big fight: find the best ranking

Similarity between binary vectors

Document is binary vector X,Y in {0,1}D

Score: overlap measure

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

YX What’s wrong ?

Page 6: Algoritmi per IR Ranking. The big fight: find the best ranking

Normalization

Dice coefficient (wrt avg #terms):

Jaccard coefficient (wrt possible terms):

YXYX /

|)||/(|2 YXYX

OK, triangular

NO, triangular

Page 7: Algoritmi per IR Ranking. The big fight: find the best ranking

What’s wrong in doc-similarity ?

Overlap matching doesn’t consider: Term frequency in a document

Talks more of t ? Then t should be weighted more.

Term scarcity in collection of commoner than baby bed

Length of documents score should be normalized

Page 8: Algoritmi per IR Ranking. The big fight: find the best ranking

A famous “weight”: tf-idf

)/log(,, tdtdt nntfw

Frequency of term t in doc d = #occt / |d| tf t,d

where nt = #docs containing term t n = #docs in the indexed collection

log nnidft

t

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 13,1 11,4 0,0 0,0 0,0 0,0

Brutus 3,0 8,3 0,0 1,0 0,0 0,0

Caesar 2,3 2,3 0,0 0,5 0,3 0,3

Calpurnia 0,0 11,2 0,0 0,0 0,0 0,0

Cleopatra 17,7 0,0 0,0 0,0 0,0 0,0

mercy 0,5 0,0 0,7 0,9 0,9 0,3

worser 1,2 0,0 0,6 0,6 0,6 0,0

Vector Space model

Page 9: Algoritmi per IR Ranking. The big fight: find the best ranking

Why distance is a bad idea

Sec. 6.3

Page 10: Algoritmi per IR Ranking. The big fight: find the best ranking

A graphical example

Postulate: Documents that are “close together” in the vector space talk about the same things.Euclidean distance sensible to vector length !!

t1

d2

d1

d3

d4

d5

t3

t2

cos() = v w / ||v|| * ||w||

The user query is a very short doc

Easy to Spam

Sophisticated algosto find top-k docs

for a query Q

Page 11: Algoritmi per IR Ranking. The big fight: find the best ranking

cosine(query,document)

D

i i

D

i i

D

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Dot product

qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.

Sec. 6.3

Page 12: Algoritmi per IR Ranking. The big fight: find the best ranking

Cos for length-normalized vectors

For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):

for q, d length-normalized.

12

D

i iidqdqdq1

),cos(

Page 13: Algoritmi per IR Ranking. The big fight: find the best ranking

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar arethe novelsSaS: Sense andSensibilityPaP: Pride andPrejudice, andWH: WutheringHeights?

Term frequencies (counts)

Sec. 6.3

Note: To simplify this example, we don’t do idf weighting.

otherwise 0,

0 tfif, tflog 1 10 t,dt,d

t,dw

Page 14: Algoritmi per IR Ranking. The big fight: find the best ranking

3 documents example contd.

Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30

jealous 2.00 1.85 2.04

gossip 1.30 0 1.78

wuthering 0 0 2.58

After length normalization

term SaS PaP WH

affection 0.789 0.832 0.524

jealous 0.515 0.555 0.465

gossip 0.335 0 0.405

wuthering 0 0 0.588

cos(SaS,PaP) ≈0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0≈ 0.94cos(SaS,WH) ≈ 0.79cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SAS,WH)?

Sec. 6.3

Page 15: Algoritmi per IR Ranking. The big fight: find the best ranking

Vector spaces and other operators

Vector space OK for bag-of-words queries

Clean metaphor for similar-document

queries Not a good combination with operators:

Boolean, wild-card, positional, proximity

First generation of search engines Invented before “spamming” web

search

Page 16: Algoritmi per IR Ranking. The big fight: find the best ranking

Algoritmi per IR

Top-k retrieval

Page 17: Algoritmi per IR Ranking. The big fight: find the best ranking

Speed-up top-k retrieval

Costly is the computation of the cos

Find a set A of contenders, with K < |A| << N A does not necessarily contain the top K, but

has many docs from among the top K Return the top K docs in A, according to the

score

The same approach is also used for other (non-cosine) scoring functions

Will look at several schemes following this approach

Sec. 7.1.1

Page 18: Algoritmi per IR Ranking. The big fight: find the best ranking

Index elimination

Consider docs containing at least one query term. Hence this means…

Take this further:1. Only consider high-idf query terms2. Only consider docs containing many query

terms

Sec. 7.1.2

Page 19: Algoritmi per IR Ranking. The big fight: find the best ranking

High-idf query terms only

For a query such as catcher in the rye Only accumulate scores from catcher and

rye

Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much

Benefit: Postings of low-idf terms have many docs these (many) docs get eliminated from set

A of contenders

Sec. 7.1.2

Page 20: Algoritmi per IR Ranking. The big fight: find the best ranking

Docs containing many query terms

For multi-term queries, compute scores for docs containing several of the query terms

Say, at least 3 out of 4 Imposes a “soft conjunction” on queries

seen on web search engines (early Google)

Easy to implement in postings traversal

Sec. 7.1.2

Page 21: Algoritmi per IR Ranking. The big fight: find the best ranking

3 of 4 query terms

Brutus

Caesar

Calpurnia

1 2 3 5 8 13 21 34

2 4 8 16 32 64 128

13 16

Antony 3 4 8 16 32 64 128

32

Scores only computed for docs 8, 16 and 32.

Sec. 7.1.2

Page 22: Algoritmi per IR Ranking. The big fight: find the best ranking

Champion Lists

Preprocess: Assign to each term, its m best documents

Search: If |Q| = q terms, merge their preferred lists ( mq answers). Compute COS between Q and these docs, and choose the top

k.Need to pick m>k to work well empirically.

Now SE use tf-idf PLUS PageRank (PLUS other weights)

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 13.1 11.4 0.0 0.0 0.0 0.0

Brutus 3.0 8.3 0.0 1.0 0.0 0.0

Caesar 2.3 2.3 0.0 0.5 0.3 0.3

Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0

Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0

mercy 0.5 0.0 0.7 0.9 0.9 0.3

worser 1.2 0.0 0.6 0.6 0.6 0.0

Page 23: Algoritmi per IR Ranking. The big fight: find the best ranking

Complex scores

Consider a simple total score combining cosine relevance and authority

net-score(q,d) = PR(d) + cosine(q,d) Can use some other linear combination than

an equal weighting

Now we seek the top K docs by net score

Sec. 7.1.4

Page 24: Algoritmi per IR Ranking. The big fight: find the best ranking

Advanced: Fancy-hits heuristic Preprocess:

Assign docID by decreasing PR weight Define FH(t) = m docs for t with highest tf-idf weight Define IL(t) = the rest (i.e. incr docID = decr PR

weight) Idea: a document that scores high should be in FH or in the front of IL

Search for a t-term query: First FH: Take the common docs of their FH

compute the score of these docs and keep the top-k docs.

Then IL: scan ILs and check the common docs Compute the score and possibly insert them into the top-k. Stop when M docs have been checked or the PR score

becomes smaller than some threshold.

Page 25: Algoritmi per IR Ranking. The big fight: find the best ranking

Algoritmi per IR

Speed-up querying by clustering

Page 26: Algoritmi per IR Ranking. The big fight: find the best ranking

Visualization

Query

Leader Follower

Sec. 7.1.6

Page 27: Algoritmi per IR Ranking. The big fight: find the best ranking

Cluster pruning: preprocessing

Pick N docs at random: call these leaders

For every other doc, pre-compute nearest leader Docs attached to a leader: its

followers; Likely: each leader has ~ N

followers.

Sec. 7.1.6

Page 28: Algoritmi per IR Ranking. The big fight: find the best ranking

Cluster pruning: query processing

Process a query as follows:

Given query Q, find its nearest leader L.

Seek K nearest docs from among L’s followers.

Sec. 7.1.6

Page 29: Algoritmi per IR Ranking. The big fight: find the best ranking

Why use random sampling

Fast Leaders reflect data distribution

Sec. 7.1.6

Page 30: Algoritmi per IR Ranking. The big fight: find the best ranking

General variants

Have each follower attached to b1=3 (say) nearest leaders.

From query, find b2=4 (say) nearest leaders and their followers.

Can recur on leader/follower construction.

Sec. 7.1.6

Page 31: Algoritmi per IR Ranking. The big fight: find the best ranking

Algoritmi per IR

Relevance feedback

Page 32: Algoritmi per IR Ranking. The big fight: find the best ranking

Relevance Feedback

Relevance feedback: user feedback on relevance of docs in initial set of results

User issues a (short, simple) query The user marks some results as relevant or

non-relevant. The system computes a better representation

of the information need based on feedback. Relevance feedback can go through one or

more iterations.

Sec. 9.1

Page 33: Algoritmi per IR Ranking. The big fight: find the best ranking

Rocchio’s Algorithm

The Rocchio algorithm uses the vector space model to pick a relevance feed-back query

Rocchio seeks the query qopt that maximizes

))](,cos())(,[cos(maxarg nrr

q

opt DqDqq

Sec. 9.1.1

Page 34: Algoritmi per IR Ranking. The big fight: find the best ranking

Rocchio (SMART)

Used in practice:

Dr = set of known relevant doc vectors Dnr = set of known irrelevant doc vectors qm = modified query vector; q0 = original query

vector; α,β,γ: weights (hand-chosen or set empirically)

New query moves toward relevant documents and away from irrelevant documents

nrjrj Ddj

nrDdj

rm d

Dd

Dqq

110

Sec. 9.1.1

Page 35: Algoritmi per IR Ranking. The big fight: find the best ranking

Relevance Feedback: Problems

Users are often reluctant to provide explicit feedback

It’s often harder to understand why a particular document was retrieved after applying relevance feedback

There is no clear evidence that relevance feedback is the “best use” of the user’s time.

Page 36: Algoritmi per IR Ranking. The big fight: find the best ranking

Relevance Feedback on the Web

Some search engines offer a similar/related pages feature (this is a trivial form of relevance feedback)

Google (link-based) Altavista Stanford WebBase

Some don’t because it’s hard to explain to users: Alltheweb bing Yahoo

Excite initially had true relevance feedback, but abandoned it due to lack of use.

α/β/γ ??

Sec. 9.1.4

Page 37: Algoritmi per IR Ranking. The big fight: find the best ranking

Pseudo relevance feedback

Pseudo-relevance feedback automates the “manual” part of true relevance feedback.

Retrieve a list of hits for the user’s query Assume that the top k are relevant. Do relevance feedback (e.g., Rocchio)

Works very well on average But can go horribly wrong for some

queries. Several iterations can cause query drift.

Sec. 9.1.6

Page 38: Algoritmi per IR Ranking. The big fight: find the best ranking

Query Expansion

In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents

In query expansion, users give additional input (good/bad search term) on words or phrases

Sec. 9.2.2

Page 39: Algoritmi per IR Ranking. The big fight: find the best ranking

How augment the user query? Manual thesaurus (costly to generate)

E.g. MedLine: physician, syn: doc, doctor, MD

Global Analysis (static; all docs in collection) Automatically derived thesaurus

(co-occurrence statistics) Refinements based on query-log mining

Common on the web

Local Analysis (dynamic) Analysis of documents in result set

Sec. 9.2.2

Page 40: Algoritmi per IR Ranking. The big fight: find the best ranking

Query assist

Would you expect such a feature to increase the queryvolume at a search engine?

Page 41: Algoritmi per IR Ranking. The big fight: find the best ranking

Algoritmi per IR

Zone indexes

Page 42: Algoritmi per IR Ranking. The big fight: find the best ranking

Parametric and zone indexes

Thus far, a doc has been a term sequence

But documents have multiple parts: Author Title Date of publication Language Format etc.

These are the metadata about a document

Sec. 6.1

Page 43: Algoritmi per IR Ranking. The big fight: find the best ranking

Zone

A zone is a region of the doc that can contain an arbitrary amount of text e.g., Title Abstract References …

Build inverted indexes on fields AND zones to permit querying

E.g., “find docs with merchant in the title zone and matching the query gentle rain”

Sec. 6.1

Page 44: Algoritmi per IR Ranking. The big fight: find the best ranking

Example zone indexes

Encode zones in dictionary vs. postings.

Sec. 6.1

Page 45: Algoritmi per IR Ranking. The big fight: find the best ranking

Tiered indexes

Break postings up into a hierarchy of lists Most important … Least important

Inverted index thus broken up into tiers of decreasing importance

At query time use top tier unless it fails to yield K docs If so drop to lower tiers

Sec. 7.2.1

Page 46: Algoritmi per IR Ranking. The big fight: find the best ranking

Example tiered index

Sec. 7.2.1

Page 47: Algoritmi per IR Ranking. The big fight: find the best ranking

Query term proximity

Free text queries: just a set of terms typed into the query box – common on the web

Users prefer docs in which query terms occur within close proximity of each other

Would like scoring function to take this into account – how?

Sec. 7.2.2

Page 48: Algoritmi per IR Ranking. The big fight: find the best ranking

Query parsers

E.g. query rising interest rates

Run the query as a phrase query

If <K docs contain the phrase rising interest

rates, run the two phrase queries rising

interest and interest rates

If we still have <K docs, run the vector

space query rising interest rates

Rank matching docs by vector space scoring

Sec. 7.2.3

Page 49: Algoritmi per IR Ranking. The big fight: find the best ranking

Algoritmi per IR

Quality of a Search Engine

Page 50: Algoritmi per IR Ranking. The big fight: find the best ranking

Is it good ?

How fast does it index Number of documents/hour (Average document size)

How fast does it search Latency as a function of index size

Expressiveness of the query language

Page 51: Algoritmi per IR Ranking. The big fight: find the best ranking

Measures for a search engine

All of the preceding criteria are measurable

The key measure: user happiness…useless answers won’t make a user happy

Page 52: Algoritmi per IR Ranking. The big fight: find the best ranking

Happiness: elusive to measure

Commonest approach is given by the relevance of search results How do we measure it ?

Requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or

Irrelevant for each query-doc pair

Page 53: Algoritmi per IR Ranking. The big fight: find the best ranking

Evaluating an IR system

Standard benchmarks TREC: National Institute of Standards and

Testing (NIST) has run large IR testbed for

many years

Other doc collections: marked by human

experts, for each query and for each doc,

Relevant or Irrelevant

On the Web everything is more complicated since we cannot mark the entire corpus !!

Page 54: Algoritmi per IR Ranking. The big fight: find the best ranking

General scenario

Relevant

Retrieved

collection

Page 55: Algoritmi per IR Ranking. The big fight: find the best ranking

Precision: % docs retrieved that are relevant [issue “junk” found]

Precision vs. Recall

Relevant

Retrieved

collection

Recall: % docs relevant that are retrieved [issue “info” found]

Page 56: Algoritmi per IR Ranking. The big fight: find the best ranking

How to compute them

Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved

Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)

Relevant Not Relevant

Retrieved tp (true positive) fp (false positive)

Not Retrieved

fn (false negative) tn (true negative)

Page 57: Algoritmi per IR Ranking. The big fight: find the best ranking

Some considerations

Can get high recall (but low precision) by retrieving all docs for all queries!

Recall is a non-decreasing function of the number of docs retrieved

Precision usually decreases

Page 58: Algoritmi per IR Ranking. The big fight: find the best ranking

Precision-Recall curve

We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries

precision

recall

x

x

x

x

Page 59: Algoritmi per IR Ranking. The big fight: find the best ranking

A common picture

precision

recall

x

x

x

x

Page 60: Algoritmi per IR Ranking. The big fight: find the best ranking

F measure

Combined measure (weighted harmonic mean):

People usually use balanced F1 measure

i.e., with = ½ thus 1/F = ½ (1/P + 1/R)

Use this if you need to optimize a single measure

that balances precision and recall.

RP

F1

)1(1

1

Page 61: Algoritmi per IR Ranking. The big fight: find the best ranking

Information Retrieval

Recommendation Systems

Page 62: Algoritmi per IR Ranking. The big fight: find the best ranking

Recommendations

We have a list of restaurants with and ratings for some

Which restaurant(s) should I recommend to Dave?

Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice Yes No Yes NoBob Yes No No

Cindy Yes No NoDave No No Yes Yes YesEstie No Yes Yes YesFred No No

Page 63: Algoritmi per IR Ranking. The big fight: find the best ranking

Basic Algorithm

Recommend the most popular restaurants say # positive votes minus # negative votes

What if Dave does not like Spaghetti?

Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1

Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1

Page 64: Algoritmi per IR Ranking. The big fight: find the best ranking

Smart Algorithm

Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes.

Perhaps recommend Straits Cafe to Dave

Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1

Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1

Do you want to rely on one person’s opinions?