![Page 1: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/1.jpg)
Implementing Query Implementing Query ClassificationClassification
HYP: End of Semester Update prepared Minh
![Page 2: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/2.jpg)
Previously…Previously…Web search queries:◦Understand user goal
Broder (et al 2002):◦Queries are classified into 3 categories:
Informational Navigational Transactional
![Page 3: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/3.jpg)
Previously…Previously…Functional Faceted Web Query
Classification Ambiguity: Polysemous, General, Specific Authority Sensitivity: Yes - No Spatial Sensitivity: Yes - No Temporal Sensitivity: Yes - No
◦Query’s 4-Tuple: <Am, Au, S, T>◦3 * 2 * 2 * 2 = 24 different combinations.
![Page 4: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/4.jpg)
Temporal SensitivityTemporal SensitivityDefinition:◦A keyword is temporal sensitive if the results
returned by querying it on web search engine tends to change with respect to time.
◦Example: Temporal sensitive: Liverpool, Beyonce, Jennifer
Hawkins, etc.. Non-temporal sensitive: video, buying car, etc..
![Page 5: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/5.jpg)
Up-to-date Project ScopeUp-to-date Project ScopeObjective: to analyze the temporal
sensitivity facet of web search queries.Problem: find the temporal correlation
between web queries
![Page 6: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/6.jpg)
Web Query HistogramWeb Query HistogramPeriodic queries:
Non-periodic queries:
Champions League Final
Liverpool
![Page 7: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/7.jpg)
Queries CorrelationQueries CorrelationCorrelation
Observation: 2 keywords are temporally related to each other
![Page 8: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/8.jpg)
Proposed System FrameworkProposed System Framework
1. Ask Google Trends for query’s histogram2. Use histogram digitizer program
(Plotparser by WeiHua) to get the numerical data
3. Query Correlation: • Calculate correlation coefficient between
queries
4. Query classification
![Page 9: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/9.jpg)
Google TrendsGoogle Trends
![Page 10: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/10.jpg)
Histogram DigitizerHistogram Digitizer
![Page 11: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/11.jpg)
Queries Correlation: 1Queries Correlation: 1stst attempt attemptCalculate Correlation coefficient:◦Using data of 45 months: Jan 2004 until
September 2007◦Calculate coefficient based on the entire
histograms
![Page 12: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/12.jpg)
Result classification: 1Result classification: 1stst attempt attemptData of 15 different popular keywords, of
which:◦ Periodic keywords:
Champions League Final, Grammy, Pro Evolution Soccer, Oscar Winner, Valentine, Chrismas(!).
◦ Related keywords: PS2, Xbox, Jack Nicholson, Beyonce , chocolate, chocolateNews, Liverpool, EA Sport, Konami
All keywords are compare to each other based on correlation coefficient of their histograms.
(15*14)/2 = 105 instances
![Page 13: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/13.jpg)
Result classification: 1Result classification: 1stst attempt attemptClassification based on threshold
method:◦Statistical result:
Threshold value: 0.25Correlation Prediction
True Positive Rate False Positive Rate
Yes 88.89% 10.34%
No 89.66% 11.11%
![Page 14: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/14.jpg)
11stst attempt Problems: attempt Problems:Very low threshold value◦Only one feature used.
Using entire histogram, while some keywords are only temporally related to each other at some periods of time.◦Example: Valentine – Chocolate (Correlation
appears during February)
![Page 15: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/15.jpg)
Queries Correlation: 2Queries Correlation: 2ndnd attempt attemptInteresting period:◦Period in which two query are highly related
to each other
-> Segmentation (Clustering) problem
![Page 16: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/16.jpg)
Clustering Using Simple K meansClustering Using Simple K meansAlgorithm to predict no. of clustersUse WEKA to cluster the histogram
![Page 17: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/17.jpg)
Query Correlation: 2Query Correlation: 2ndnd attempt attemptPeriodic keywords detection:◦ Identify repeated pattern using correlation◦Periodic query tends to have highly
correlation coefficient on repeated part.
![Page 18: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/18.jpg)
Interesting Periods ProjectionInteresting Periods ProjectionInteresting periods from related keyword
histogram is to be projected on periodic keyword’s histogram
![Page 19: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/19.jpg)
Result Classification: 2Result Classification: 2ndnd Attempt AttemptUsing previous datasetRelated keywords are compared with
each of periodic keywords for correlationResult:◦Manage to increase threshold value to: 0.5
![Page 20: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/20.jpg)
22ndnd attempt problems attempt problemsK – means clustering does not guarantee
correct interesting periods detection:◦Due to the fact that we have to provide no. of
cluster for K-means -> implemented algorithm to determine no. of
cluster failed to provide correct valueSmall training data set. Too simple method of threshold
detector.
![Page 21: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/21.jpg)
Queries Correlation: 3Queries Correlation: 3rdrd attempt attemptNeed to find another way to identify
interesting period.Peak period:◦Period in which there is a high peak in query
volumePeak detection problem:◦Mapping and smoothing using convolution
![Page 22: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/22.jpg)
Clustering using peak detectionClustering using peak detectionMapping:
![Page 23: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/23.jpg)
Clustering using peak detectionClustering using peak detectionSmoothing using convolution:
![Page 24: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/24.jpg)
Clustering using peak detectionClustering using peak detectionPeak Detection: using simple slope-
change algorithm to determine peaks and valleys ◦(with threshold value: mean)
![Page 25: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/25.jpg)
Interesting periods ProjectionsInteresting periods ProjectionsInteresting periods from related keyword
histogram is to be projected on periodic keyword’s histogram and vice versa
![Page 26: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/26.jpg)
Result Classification: 3Result Classification: 3rdrd attempt attemptUse large training data:◦47 popular keywords, of which:
15 periodic keywords and 32 related keywords Each related keyword is to compared with every
periodic keyword to get correlation coefficient (Coef).
◦Data size: 15 * 32 = 480 instances
![Page 27: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/27.jpg)
Result Classification: 3Result Classification: 3rdrd attempt attemptApply Naïve Bayes Classifier (WEKA):
6 features: Average Coef from related keyword projection (AveRCoef) Average Coef from periodic keyword projection (AvePCoef) Overall Average Coef [= (AveRCoef+AvePCoef)/2]
Max Coef from related keyword projection (MaxRCoef) Max Coef from periodic keyword projection (MaxPCoef) Average Max Coef [= (MaxRCoef+MaxPCoef)/2 ]
![Page 28: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/28.jpg)
Result Classification: 3Result Classification: 3rdrd attempt attemptStatistical Result:
Confusion Matrix
Correlation Prediction
True Positive Rate
False Positive Rate
Recall F-Measure
Yes 89.3% 5.2% 0.893 0.725
No 94.8% 10.7% 0.948 0.969
A B <- classified as
25 3 A = Yes
16 294 B = No
![Page 29: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/29.jpg)
Future attempt: Future attempt: Query NormalizationQuery NormalizationSearch volumes tends to increase as the
Internet becomes more popularHistogram for Top 20 most popular
keywords of all time:
![Page 30: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/30.jpg)
Future attempt: Future attempt: NormalizationNormalizationHistograms need to be normalize to
ignore this trend’s effect!Proposed action:◦Subtract time effect◦Current Problem: More distortions are added
due to scaling problem. -> histogram from Google have been scaled. We
have no information of raw data.
![Page 31: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/31.jpg)
Future attempt: Future attempt: From Periodic to Non-periodicFrom Periodic to Non-periodicFind the correlation between two non-
periodic queries.Proposed Problem: some keywords are
highly searched after other keywords◦Example: “tsunami” is usually searched after
“earthquake” is issued.
![Page 32: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/32.jpg)
Future attempt: Future attempt: From Periodic to Non-PeriodicFrom Periodic to Non-Periodic
Tsunami
Earthquake
![Page 33: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/33.jpg)
Potential ApplicationsPotential ApplicationsResults re-ranking:◦Move result that is more up-to-date up on the
result list Example: when user ask for Beyonce during the
time of Grammy -> result that related to Grammy will have a higher rank
Server Buffering:◦When user query Beyonce, the web page that
related to Grammy will be buffer in local server in hope that the user will tend to search for Grammy eventually.
![Page 34: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/34.jpg)
Question?Question?
![Page 35: Implementing Query Classification HYP: End of Semester Update prepared Minh](https://reader036.vdocument.in/reader036/viewer/2022062517/56649ecf5503460f94bdd03b/html5/thumbnails/35.jpg)
The EndThe End