web query analysis: aligning queries to periodic events

1
Web Query Analysis: Aligning Queries to Periodic Events Student: Trinh Hoang Minh Supervisor: Dr. Min-Yen Kan Project ID:H079370 We describe a method to determine the temporal correlation between web queries. In particular: Study how to identify periodic query-as-events Correlate other non- periodic queries to these events. Develop a prototype to analyze such temporal correlation between queries and assess its performance, resulting in over 90% accuracy. To identify whether other (non-periodic) queries are correlated to these periodic events-as-queries, we again used supervised classification, using four main of features. Overall Correlation: calculate the full period query histogram correlation coefficient to find out the temporal correlation coefficient. Most Recent Year Correlation: The correlation coefficient for the last 12 months (i.e. 2007) is calculated and treated as a separate feature. Conjunctive Data: Two features measure the strength of the conjoined queries: the number of the web search results found a search engine (Google’s Search API, in our case). Two classifiers has been trained, a periodic classifier and a correlation classifier. Periodic Classifier Evaluation Obtain judgments from 7 staff members for all 68 queries, resulting in Fleiss’ Kappa score of K = 0.802. High true positive value of 93.1%, with K = 0.794 when compared with the human judges. Correlation Classifier Evaluation divide the queries into periodic and non-periodic categories based on previous classification. apply the second correlation classifier on pairs of queries, one drawn from each class. Each pair was manually classified as to whether the two queries were thought to be correlated. High true positive rate of 93.3%, with Fleiss’ Kappa score = 0.70 when compared with human judges. We collected 29 popular queries which represent annual events and 39 related queries to those events. Using the popular trend search from Google Trends, we obtain the query volume histogram images. Numerical data are then extracted from downloaded images using a graph digitizer. (a) Periodic Non-periodic Classify as Periodic 27(93.1%) 1(2.6%) Classify as Non- periodic 2(6.9%) 38(97.4%) (b) Correlated Not Correlated Classify as Correlated 56(93.3%) 35(2.6%) Classify as Not Correlated 4(6.7%) 1309(97.4%) C o r r e l a t i o n R e p o r t User’s Report Input Query Google Trends Graph Digitizer Connect to Google Trends U p d a t e C o r r e l a t i o n D a t a b a s e Correla tion Databas e Is Exist Correlation Report N o t e n o u g h I n f o r m a t i o n Fault Query Cache Table Query Cache Table Query Analysis Query Histogram Analysis Query Periodic Classifica tion Query Temporal Correlation Classification Periodi c Table Updat e Get other queries Is faile d query Is new query Numerica l Reports Query suggestion prototype Discar ded Area Discar ded Area An original (blue) vs. lagged histogram(red) Key period Correlation: A key period is defined as a period with high search volume, relative to other periods. We then apply correlation coefficient equation during these key periods only. A recurring event has regular, repeated peaks in its histogram, corresponding to the event’s actual date. We train a supervised Bayesian Network Classifier, using two main features: Autocorrelation Function(ACF) with a lag value k set equivalent to one year. Correlation Coefficient Value (CCV) of pair-wise yearly histograms (2005, 2006, 2007). To reduce noise and variability, Dynamic time warping (DTW) was applied to find the best match among yearly histograms. Contribution: Periodic query classification Temporal correlation for web queries: Correlate queries to periodic events with reasonable accuracy, using only relative volume histograms and search results. Facilitate proactive query suggestion or re-ranking of search results, which we are planning to explore as applications. Future work: extend our work by integrating more data on query trends from news and blog trends. extend our work to use partial An example of query yearly histograms

Upload: brighton-titus

Post on 01-Jan-2016

36 views

Category:

Documents


5 download

DESCRIPTION

Web Query Analysis: Aligning Queries to Periodic Events Student: Trinh Hoang Minh Supervisor: Dr. Min-Yen Kan Project ID:H079370. User’s Report. Input Query. Query Cache Table. Correlation Database. Query Analysis. Is Exist. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Query Analysis: Aligning Queries to Periodic Events

Web Query Analysis: Aligning Queries to Periodic Events

Student: Trinh Hoang Minh Supervisor: Dr. Min-Yen Kan Project ID:H079370

Web Query Analysis: Aligning Queries to Periodic Events

Student: Trinh Hoang Minh Supervisor: Dr. Min-Yen Kan Project ID:H079370

We describe a method to determine the temporal correlation between web queries. In particular: • Study how to identify periodic query-as-events• Correlate other non-periodic queries to these events.• Develop a prototype to analyze such temporal correlation between queries and assess its performance, resulting in over 90% accuracy.

To identify whether other (non-periodic) queries are correlated to these periodic events-as-queries, we again used supervised classification, using four main of features.

• Overall Correlation: calculate the full period query histogram correlation coefficient to find out the temporal correlation coefficient.

• Most Recent Year Correlation: The correlation coefficient for the last 12 months (i.e. 2007) is calculated and treated as a separate feature.

• Conjunctive Data: Two features measure the strength of the conjoined queries: • the number of the web search results found a search engine (Google’s Search API, in our case). • the number of times the two queries appear together in the top ten titles.

Two classifiers has been trained, a periodic classifier and a correlation classifier.

Periodic Classifier Evaluation• Obtain judgments from 7 staff members for all 68 queries, resulting in Fleiss’ Kappa score of K = 0.802.• High true positive value of 93.1%, with K = 0.794 when compared with the human judges.

Correlation Classifier Evaluation• divide the queries into periodic and non-periodic categories based on previous classification.• apply the second correlation classifier on pairs of queries, one drawn from each class. • Each pair was manually classified as to whether the two queries were thought to be correlated. • High true positive rate of 93.3%, with Fleiss’ Kappa score = 0.70 when compared with human judges.

• We collected 29 popular queries which represent annual events and 39 related queries to those events.

• Using the popular trend search from Google Trends, we obtain the query volume histogram images. Numerical data are then extracted from downloaded images using a graph digitizer.

(a) Periodic Non-periodicClassify as Periodic 27(93.1%) 1(2.6%)

Classify as Non-periodic 2(6.9%) 38(97.4%)

(b) Correlated Not CorrelatedClassify as Correlated 56(93.3%) 35(2.6%)

Classify as Not Correlated 4(6.7%) 1309(97.4%)

Correlation Report

User’s Report Input Query

Google Trends

Graph Digitizer

Connect to Google Trends U

pdate Correlation Database

Correlation Database

Is Exist

Correlation Report

Not enough

Information

Fault Query Cache Table

Query Cache Table

Query Analysis

Query Histogram Analysis

Query Periodic Classification

Query Temporal Correlation

Classification

Periodic Table

Update

Get other queries

Is failed queryIs new query

Numerical Reports

Query suggestion prototype

Discarded Area

Discarded Area

An original (blue) vs. lagged histogram(red)

• Key period Correlation: A key period is defined as a period with high search volume, relative to other periods.

We then apply correlation coefficient equation during these key periods only.

A recurring event has regular, repeated peaks in its histogram, corresponding to the event’s actual date. We train a supervised Bayesian Network Classifier, using two main features:

• Autocorrelation Function(ACF) with a lag value k set equivalent to one year.

• Correlation Coefficient Value (CCV) of pair-wise yearly histograms (2005, 2006, 2007). To reduce noise and variability, Dynamic time warping (DTW) was applied to find the best match among yearly histograms.

Contribution:• Periodic query classification • Temporal correlation for web queries: Correlate queries to periodic events with reasonable accuracy, using only relative volume histograms and search results. • Facilitate proactive query suggestion or re-ranking of search results, which we are planning to explore as applications.

Future work:• extend our work by integrating more data on query trends from news and blog trends.• extend our work to use partial correlation to correct for overall query volume growth.

An example of query yearly histograms