beyond boolean queries

Slide 1

Beyond Boolean QueriesRanked retrievalThus far, our queries have all been Boolean.Documents either match or dont.Good for expert users with precise understanding of their needs and the collection.Also good for applications: Applications can easily consume 1000s of results.Writing the precise query that will produce the desired result is difficult for most users.

Problem with Boolean search:feast or famineBoolean queries often result in either too few (=0) or too many (1000s) results.A query that is too broad yields hundreds of thousands of hitsA query that is too narrow may yield no hitsIt takes a lot of skill to come up with a query that produces a manageable number of hits.AND gives too few; OR gives too manyCf. our discussion of how Westlaw Boolean queries didnt actually outperform free text querying

3Ranked retrieval modelsRather than a set of documents satisfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a queryFree text queries: Rather than a query language of operators and expressions, the users query is just one or more words in a human languageIn principle, these are different options, but in practice, ranked retrieval models have normally been associated with free text queries and vice versa

4Feast or famine: not a problem in ranked retrievalWhen a system produces a ranked result set, large result sets are not an issueIndeed, the size of the result set is not an issueWe just show the top k ( 10) resultsWe dont overwhelm the user

Premise: the ranking algorithm worksScoring as the basis of ranked retrievalWe wish to return, in order, the documents most likely to be useful to the searcherHow can we rank-order the documents in the collection with respect to a query?Assign a score say in [0, 1] to each documentThis score measures how well document and query match.Query-document matching scoresWe need a way of assigning a score to a query/document pairLets start with a one-term queryIf the query term does not occur in the document: score should be 0The more frequent the query term in the document, the higher the score (should be)We will look at a number of alternatives for this.Take 1: Jaccard coefficientA commonly used measure of overlap of two sets A and Bjaccard(A,B) = |A B| / |A B|jaccard(A,A) = 1jaccard(A,B) = 0 if A B = 0A and B dont have to be the same size.Always assigns a number between 0 and 1.Jaccard coefficient: Scoring exampleWhat is the query-document match score that the Jaccard coefficient computes for each of the two documents below?Query: ides of marchDocument 1: caesar died in marchDocument 2: the long marchDoc 1: 1/6 (Using all words, no stop words ignored)Doc 2: 1/59Jaccard Example doneQuery: ides of marchDocument 1: caesar died in marchDocument 2: the long marchA = {ides, of, march}B1 = {caesar, died, in, march}

B2 = {the, long, march}

Spot checkQuery: registration deadlineDocument 1: new conference submission deadlineDocument 2: Online registration openA = {}B1 = {}B2 = {}Calculate the Jaccard coefficients for the query and each of the documentsDiscuss in Blackboard as much as needed. Post your response in the Spot Check 2

Issues with Jaccard for scoringIt doesnt consider term frequency (how many times a term occurs in a document)Rare terms in a collection are more informative than frequent terms. Jaccard doesnt consider this informationWe need a more sophisticated way of normalizing for lengthThe problem with the first example was that document 2 won because it was shorter, not because it was a better match. We need a way to take into account document length so that longer documents are not penalized in calculating the match score.

beyond boolean queries

Documents

query match

users query

query term

precise query

hitsa query

query expression

querydocument match

query language of operators