a simple unsupervised query categorizer for web search engines prashant ullegaddi and vasudeva varma...

27
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies Research Center IIIT-Hyderabad 500 032 ICON 2010

Upload: emma-richardson

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

A Simple Unsupervised Query Categorizer for Web Search

Engines

Prashant Ullegaddi and Vasudeva VarmaSearch and Information Extraction Lab

Language Technologies Research CenterIIIT-Hyderabad 500 032

ICON 2010

Page 2: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Outline• Query categorization• Related work• Importance of ranking• Challenges• Design goals• Our approach• Results• Conclusion

Page 3: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Query categorization (QC)

• Automatic categorization (classification) of user queries into one or more of pre-defined categories

• Note that categories are pre-defined and may vary across different applications

• However, for a particular application categories remain the same over a reasonable amount of time

Page 4: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Contributions

• Solving query categorization as a purely information retrieval problem

• Emphasis on importance of ranking of categories for QC systems

• Our system being simple and unsupervised in nature can establish a new baseline

Page 5: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Related Work

• Text categorization techniques (Shen et al., 2005, 2006): – Solve QC as a text categorization problem– But queries are not as rich as text documents in

terms of context– Text classifiers are trained with a static vocabulary,

which may not account for the dynamic nature of the Web.

Page 6: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

…Related Work

• Graph based models (Diemert and Vandelle, 2009).– Constructing concept graphs built from search

query logs– Once the concept graph is constructed, a query is

categorized by traversing through the graph.– Not all search engines have the luxury of large

search query logs.

Page 7: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Research Questions

• Can we solve QC by considering it purely as an IR problem?

• Can we combine the existing relatively standard IR techniques to solve QC?

• Can already categorized corpus be used for conducting query categorization?

• Can we establish a new baseline for QC systems?

Page 8: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Importance of ranking

• Consider category listings of two hypothetical systems for the query “Ipod”

• It is obvious from this example that ranking plays an important role for QC systems

Rank System (I) Category listing System (I) Category listing

1 Entertainment/Celebrities Entertainment/Music

2 Computers/Hardware Computers/Hardware

3 Computers/Software Computers/Software

4 Info/References & Libraries Info/References & Libraries

5 Entertainment/Music Entertainment/Celebrities

Page 9: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Challenges

• Category representation: – Categories need to be defined (covering most of

the Web)– Each category needs to be represented by a set of

documents that best describe that category.

Category representation is needed in order to solve QC purely as an IR problem.

Page 10: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

…Challenges• Query expansion/enrichment: Usually queries are very short.• Average query length in KDD Cup 2005 was 3.12 words.• 22.5% of the queries were of length 3 words.• 78.7% of the queries had at most 4 words.

Page 11: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Category Representation

• Categories of Open Directory Project (ODP) for QC• Web documents that are classified under a category

represent that category.• Approximately 2.4 million English documents (of ODP)

to represent categories • These documents are classified into approximately

380K categories.• Here the assumption is that these categories cover the

entire Web. • This corpus of ODP documents is used to perform QC.

Page 12: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Design Goals

• Our design goals:– Simple– Unsupervised framework– Implementable on Web scale– To solve QC as a “search” problem since “search”

is a task a Web search can afford for free.

Page 13: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Our Approach

Expanded Query

ODP documents ODP

Categories

Target Categories

Page 14: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Query Expansion

• Pseudo relevance feedback query expansion• Submit query to a Web search engine• Collect stemmed terms (Q’) from title and

snippets for top N search results• Stop word removal• Weight on document frequency (DF) measure

Page 15: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

…Query Expansion

• Common concepts for a query usually occur in most of the top web documents obtained for a query

• This information is best captured by DF• These common concepts represent the query

“Serena Williams”

Web Search Engine

……………..…Tennis

..sports…..………WTA

………..........Tennis…………………Wimbledon.

……tennis……………………..sports…....…..WTA..

TennisSportsWTA

Wimbledon

Page 16: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Central Idea

• The ODP documents that match the query-related concepts are good enough to carry out QC

• In essence, topically similar documents• This fact is leveraged in our unsupervised

approach to QC

Page 17: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Query Categorization

• Search the expanded query on the ODP Web document corpus

• ODP documents retrieved for the query belong to at least one ODP category; resulting in query categorization

• An optional taxonomy mapping in case target categories are different from that of ODP

Page 18: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Taxonomy Mapping for KDD Cup dataset

• We map ODP categories to KDD cup categories to evaluate on KDD Dataset

• Note that computation of these mappings is one time and offline

Page 19: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

…Taxonomy Mapping

Page 20: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

…Taxonomy Mapping

• Search the target categories in the category ODP descriptions

• For a target category t, let the set of retrieved ODP categories be C

• Map every category in C to target category t.• Repeat this for other target categories, and

obtain mappings

Page 21: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

…Taxonomy Mapping

• Let C(Q) be the set of ODP categories returned for a query Q

• The categories in target space to which most of the categories of C(Q) are getting mapped to will be ranked higher

• Top K categories in target space are returned as top K target categories for the query

Page 22: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Dataset

• KDD Cup 2005 dataset (Lie et al., 2005)• A set of unlabeled 800K queries sampled MSN search

query logs• 67 predefined categories• A set of 800 queries (sampled from the 800K queries)

was labeled• Three labelers independently labeled this set• Each query was tagged with at most 5 categories• This dataset serves as the standard dataset for QC

evaluation

Page 23: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Evaluation Metrics

Precision =

#{Queries correctly tagged as category Ci}i

#{Queries tagged as category Ci}i

Recall =

#{Queries correctly tagged as category Ci}i

#{Queries tagged as Ci by human labelers}i

F1 =2 × Precision × Recall

Precision +Recall

Precision, Recall and F1 are defined, respectively, as follows:

Page 24: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Results

Approach Precision F1 Prec@k

State of the art (Shen et al., 2005) 0.414 0.444 0.599

Best today (Shen et al., 2006b) 0.465 0.461 NA

KBS (Diemert and Vandelle 2009) 0.614 0.460 NA

Our System 0.428 0.415 0.624 (+4.2%)

*High precision reported by KBS System is due to binary categorization

Page 25: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

On Results

• Though F1 reported for our system is marginally lower, we believe our system should be viewed from a different perspective

• Solve QC purely as an information retrieval problem

• Combined relatively standard techniques to solve QC making it – simple, and – implementable on a very large scale

Page 26: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

….On Results

• Our system is unsupervised in nature• Our system does not make use of resources

like search query logs• Thus, we believe the results reported

complement our design goals to a reasonable extent

Page 27: A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies

Conclusion

• A simple, unsupervised yet effective approach to query categorization

• Leverages already categorized corpus (ODP) to perform QC

• Advantages– Simple approach– Unsupervised– Existing IR techniques can be used– Avoids Multiclass classification