Transcript

Slide 1

Beyond Boolean QueriesRanked retrievalThus far, our queries have all been Boolean.Documents either match or dont.Good for expert users with precise understanding of their needs and the collection.Also good for applications: Applications can easily consume 1000s of results.Writing the precise query that will produce the desired result is difficult for most users.

Problem with Boolean search:feast or famineBoolean queries often result in either too few (=0) or too many (1000s) results.A query that is too broad yields hundreds of thousands of hitsA query that is too narrow may yield no hitsIt takes a lot of skill to come up with a query that produces a manageable number of hits.AND gives too few; OR gives too manyCf. our discussion of how Westlaw Boolean queries didnt actually outperform free text querying

3Ranked retrieval modelsRather than a set of documents satisfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a queryFree text queries: Rather than a query language of operators and expressions, the users query is just one or more words in a human languageIn principle, these are different options, but in practice, ranked retrieval models have normally been associated with free text queries and vice versa

4Feast or famine: not a problem in ranked retrievalWhen a system produces a ranked result set, large result sets are not an issueIndeed, the size of the result set is not an issueWe just show the top k ( 10) resultsWe dont overwhelm the user

Premise: the ranking algorithm worksScoring as the basis of ranked retrievalWe wish to return, in order, the documents most likely to be useful to the searcherHow can we rank-order the documents in the collection with respect to a query?Assign a score say in [0, 1] to each documentThis score measures how well document and query match.Query-document matching scoresWe need a way of assigning a score to a query/document pairLets start with a one-term queryIf the query term does not occur in the document: score should be 0The more frequent the query term in the document, the higher the score (should be)We will look at a number of alternatives for this.Take 1: Jaccard coefficientA commonly used measure of overlap of two sets A and Bjaccard(A,B) = |A B| / |A B|jaccard(A,A) = 1jaccard(A,B) = 0 if A B = 0A and B dont have to be the same size.Always assigns a number between 0 and 1.Jaccard coefficient: Scoring exampleWhat is the query-document match score that the Jaccard coefficient computes for each of the two documents below?Query: ides of marchDocument 1: caesar died in marchDocument 2: the long marchDoc 1: 1/6 (Using all words, no stop words ignored)Doc 2: 1/59Jaccard Example doneQuery: ides of marchDocument 1: caesar died in marchDocument 2: the long marchA = {ides, of, march}B1 = {caesar, died, in, march}

B2 = {the, long, march}

Spot checkQuery: registration deadlineDocument 1: new conference submission deadlineDocument 2: Online registration openA = {}B1 = {}B2 = {}Calculate the Jaccard coefficients for the query and each of the documentsDiscuss in Blackboard as much as needed. Post your response in the Spot Check 2

Issues with Jaccard for scoringIt doesnt consider term frequency (how many times a term occurs in a document)Rare terms in a collection are more informative than frequent terms. Jaccard doesnt consider this informationWe need a more sophisticated way of normalizing for lengthThe problem with the first example was that document 2 won because it was shorter, not because it was a better match. We need a way to take into account document length so that longer documents are not penalized in calculating the match score.


Top Related