Transcript
Page 1: (Some issues in) Text Ranking

(Some issues in )Text Ranking

Page 2: (Some issues in) Text Ranking

Recall General Framework

• Crawl– Use XML structure– Follow links to get new pages

• Retrieve relevant documents – Today

• Rank– PageRank, HITS– Rank Aggregation

Page 3: (Some issues in) Text Ranking

Relevant documents

• Usually: relevant with respect to a keyword, set of keywords, logical expression..

• Closely related to ranking– “How” relevant is it can be considered another measure

• Usually done as a separate step– Recall the Online vs. offline issue..

• But some techniques are reusable

Page 4: (Some issues in) Text Ranking

Defining Relevant Documents

• Common strategy: treat text documents as “bag of words” (BOW)– Denote BOW(D) for a document D– Bag rather than set (i.e. multiplicity is kept)– Words are typically stemmed

• Reduced to root form– Loses structure, but simplifies life

• Simple definition: – A document D is relevant to a keyword W if W is in

BOW(D)

Page 5: (Some issues in) Text Ranking

Cont .

• Simple variant– The level of relevance of D to W is the multiplicity

of W in BOW(D) – Problem: Bias towards long documents– So divide by the document length |BOW(D)|– This is called term frequency (TF)

Page 6: (Some issues in) Text Ranking

A different angle

• Given a document D, what are the “most important” words in D?

• Clearly high term frequency should be considered

• Rank terms according to TF?

Page 7: (Some issues in) Text Ranking

Ranking according to TF

A 2022Is 1023He 350...Liverpool 25Beatles 12

Page 8: (Some issues in) Text Ranking

IDF

• Observation: if w is rare in the documents set, but appears many times in a document D, then w is “important” for D

• IDF(w) = log(|Docs| / |Docs’|)– Docs is the set of all documents in the corpus,

Docs’ is the subset of documents that contain w

• TFIDF(D,W)=TF(W,D)*IDF(W)– “Correlation” of D and W

Page 9: (Some issues in) Text Ranking

Inverted Index

• For every term we keep a list of all documents in which it appears

• The list is sorted by TFIDF scores

• Scores are also kept

• Given a keyword it is then easy to give the top-k

Page 10: (Some issues in) Text Ranking

Ranking• Now assume that these documents are web

pages

• How do we return the most relevant?

• How do we combine with other rankings? (e.g. PR?)

• How do we answer boolean queries?– X1 AND (X2 OR X3)

Page 11: (Some issues in) Text Ranking

Rank Aggregation

• To combine TFIDF, PageRank..

• To combine TFIDF with respect to different keywords

Page 12: (Some issues in) Text Ranking

Part-of-Speech Tagging• So far we have considered documents only as

bags-of-words• Computationally efficient, easy to program, BUT• We lost the structure that may be very important:– E.g. perhaps we are interested (more) in documents

for which W is often the sentence subject?• Part-of-speech tagging– Useful for ranking– For machine translation– Word-Sense Disambiguation– …

Page 13: (Some issues in) Text Ranking

Part-of-Speech Tagging

• Tag this word. This word is a tag.

• He dogs like a flea

• The can is in the fridge

• The sailor dogs me every day

Page 14: (Some issues in) Text Ranking

A Learning Problem

• Training set: tagged corpus– Most famous is the Brown Corpus with about 1M

words

– The goal is to learn a model from the training set, and then perform tagging of untagged text

– Performance tested on a test-set

Page 15: (Some issues in) Text Ranking

Simple Algorithm• Assign to each word its most popular tag in the training set

• Problem: Ignores context

• Dogs, tag will always be tagged as a noun…

• Can will be tagged as a verb

• Still, achieves around 80% correctness for real-life test-sets– Goes up to as high as 90% when combined with some simple

rules

Page 16: (Some issues in) Text Ranking

(HMM) Hidden Markov Model• Model: sentences are generated by a probabilistic

process

• In particular, a Markov Chain whose states correspond to Parts-of-Speech

• Transitions are probabilistic

• In each state a word is outputted– The output word is again chosen probabilistically based on

the state

Page 17: (Some issues in) Text Ranking

HMM

• HMM is:– A set of N states– A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans– A vector of size N of initial state probabilities

Pstart– A matrix NXM of emissions probabilities Pout

• “Hidden” because we see only the outputs, not the sequence of states traversed

Page 18: (Some issues in) Text Ranking

Example

Page 19: (Some issues in) Text Ranking

3 Fundamental Problems

1) Compute the probability of a given observationSequence (=sentence) 2) Given an observation sequence, find the most likely hidden state sequence This is tagging3) Given a training set find the model that would make the observations most likely

Page 20: (Some issues in) Text Ranking

Tagging

• Find the most likely sequence of states that led to an observed output sequence

• Problem: exponentially many possible sequences!

Page 21: (Some issues in) Text Ranking

Viterbi Algorithm

• Dynamic Programming• Vt,k is the probability of the most probable

state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k

Page 22: (Some issues in) Text Ranking

Viterbi Algorithm

• Dynamic Programming• Vt,k is the probability of the most probable

state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k

• V0,k = Pstart(k)*Pout(k,X0)

• Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)}

Page 23: (Some issues in) Text Ranking

Finding the path

• Note that we are interested in the most likely path, not only in its probability

• So we need to keep track at each point of the argmax– Combine them to form a sequence

• What about top-k?

Page 24: (Some issues in) Text Ranking

Complexity

• O(T*|S|^2)

• Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)

Page 25: (Some issues in) Text Ranking

Computing the probability of a sequence

• Forward probabilities: αt(k) is the probability of seeing the sequence

X1…Xt and terminating at state k• Backward probabilities:

βt(k) is the probability of seeing the sequenceXt+1…Xn given that the Markov process is atstate k at time t.

Page 26: (Some issues in) Text Ranking

Computing the probabilitiesForward algorithmα0(k)= Pstart(k)*Pout(k,X0)αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)}P(O1,…On)= Σk αn(k)

Backward algorithmβt(k) = P(Ot+1…On| state at time t is k)βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)}βn(k) = 1 for all kP(O)= Σk β 0(k)* Pstart(k)

Page 27: (Some issues in) Text Ranking

Learning the HMM probabilities

• Expectation-Maximization Algorithm1. Start with initial probabilities2. Compute Eij the expected number of transitions

from i to j while generating a sequence, for each i,j (see next)3. Set the probability of transition from i to j to be Eij/ (Σk Eik)4. Similarly for omission probability5. Repeat 2-4 using the new model, until convergence

Page 28: (Some issues in) Text Ranking

Estimating the expectancies

• By sampling– Re-run a random a execution of the model 100

times– Count transitions

• By analysis– Use Bayes rule on the formula for sequence

probability– Called the Forward-backward algorithm

Page 29: (Some issues in) Text Ranking

Accuracy

• Tested experimentally

• Exceeds 96% for the Brown corpus– Trained on half and tested on the other half

• Compare with the 80-90% by the trivial algorithm

• The hard cases are few but are very hard..

Page 30: (Some issues in) Text Ranking

NLTK

• http://www.nltk.org/

• Natrual Language ToolKit

• Open source python modules for NLP tasks– Including stemming, POS tagging and much more


Top Related