online learning for web query generation: finding documents matching a minority concept on the web...

18
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones Carnegie Mellon University, USA Dunja Mladenic J. Stefan Institute, Slovenia

Post on 15-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Online Learning for Web Query Generation: Finding Documents

Matching a Minority Concept on the WebRayid Ghani

Accenture Technology Labs, USA

Rosie JonesCarnegie Mellon University, USA

Dunja MladenicJ. Stefan Institute, Slovenia

Page 2: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Motivation

Need a collection of documents matching a particular concept

Search on the web, modify query, analyze documents, modify query,…

Repetitive, time-consuming, requires reasonable familiarity with the concept

Page 3: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Task

Given: 1 Document in Target Concept 1 Other Document (negative example) Access to a Web Search Engine

Create a Corpus of the Target Concept quickly with no human effort

Page 4: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Algorithm

Query Generator WWWSeed Docs

Filter/Classifier

Page 5: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Web

Word Statistics

Initial Docs

Build Query

Filter

Relevant

Non-Relevant

Learning

Page 6: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Query Generation

Examine current relevant and non-relavent documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones

A Query consists of m inclusion terms and n exclusion terms e.g +intelligence +web –military

Page 7: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Query Term Selection Methods

Uniform (UN) – select k words randomly from the current vocabulary

Term-Frequency (TF) – select top k words ranked according to their frequency

Probabilistic TF (PTF) – k words with probability proportional to their frequency

Page 8: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Query Term Selection Methods

RTFIDF – top k words according to their rtfidf scores

Odds-Ratio (OR) – top k words according to their odds-ratio scores

Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores

Page 9: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Query Parameters

4 Parameters Inclusion Term-Selection Method Exclusion Term-Selection Method Inclusion Length Exclusion Length

Example: Odds-Ratio, rtfidf, 3,6

Page 10: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Experimental Setup

Language: Slovenian Initial documents: 1 web page in Slovenian, 1

in English Search engine: Altavista

Page 11: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Evaluation

Goal: Collect as many relevant documents as possible while minimizing the cost

Cost Number of total documents retrieved from the Web Number of distinct Queries issued to the Search Engine

Evaluation Measures Percentage of retrieved documents that are relevant Number of relevant documents retrieved per unique query

Page 12: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Fixed Query Parameters

Fix Query Lengths and Vary Term-Selection Methods

Fix Term-Selection Methods and Vary Query Lengths

Results (Ghani et al. , SIGIR 2001): Odds-Ratio works well overall Long Queries are precise but with low recall

Page 13: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Why Online Learning?

Different Term-Selection Methods Excel with different Query Lengths

Best Combination of methods and lengths may change as different parts of the Web/feature space are explored

Page 14: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Learning Methods

Memory-Less (ML) Learning Ignore all history and only use the current performance

Long-Term Memory (LT) Learning Use all of the previous history Additive Update Rule Multiplicative Update Rule

Fading Memory (FM) Learning Use all of the history but with a decay function over time

Page 15: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Results

LTM LTM

Memory-Less Memory-Less

Page 16: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Results

Page 17: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Further Experiments

Other Languages Similar results with Croatian, Czech and Tagalog

Keywords Similar results when initializing with keywords

instead of documents Comparison to Altavista’s “More Like This”

Better performance than Altavista’s feature

Page 18: Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones

Conclusions

Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines

Online Learning is useful in adapting to different parts of the Web space

System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder