time-aware approaches to information retrievalkanhabua/thesis/defense_slides.pdf · time-aware...
Post on 22-Aug-2020
0 Views
Preview:
TRANSCRIPT
Time-aware Approaches to Information Retrieval
Nattiya Kanhabua
Department of Computer and Information Science Norwegian University of Science and Technology
24 February 2012
Nattiya Kanhabua 2 PhD defense
• Searching documents created/edited over time – E.g., web archives, news archives, blogs, or emails
Motivation
Web archives
news archives
blogs emails
“temporal document collections”
Retrieve documents about Pope Benedict XVI written
before 2005
Term-based IR approaches may give unsatisfied results
Nattiya Kanhabua 3 PhD defense
• A web archive search tool by the Internet Archive – Query by a URL, e.g., http://www.ntnu.no
Wayback Machine1
No keyword query
No relevance ranking
1Retrieved on 15 January 2011
Nattiya Kanhabua 4 PhD defense
• A news archive search tool by Google – Query by keywords – Rank results by relevance or date
Google News Archive Search
Not consider terminology changes over time
Nattiya Kanhabua 5 PhD defense
• Study problems of temporal search • Propose approaches to solve the problems
• Main research question “How to exploit temporal information in documents, queries,
and external sources in order to improve the retrieval effectiveness?”
Objective of PhD thesis
Nattiya Kanhabua 6 PhD defense
Part I - Content Analysis RQ1: How to determine time of non-timestamped documents?
Part II - Query Analysis RQ2: How to determine time of queries? RQ3: How to handle terminology changes over time? RQ4: How to predict the effectiveness of temporal queries? RQ5: How to predict the suitable time-aware ranking?
Part III - Retrieval and Ranking Models RQ6: How to model time into retrieval and ranking? RQ7: How to combine different features and time for ranking?
Outline contributions
PART I - CONTENT ANALYSIS
PhD defense Nattiya Kanhabua 7
Nattiya Kanhabua 8 PhD defense
Problem Statements • Difficult to find the trustworthy time for web documents
– Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date
RQ1: Determining time of documents
I found a bible-like document. But I have no idea when it was
created?
Let’s me see… This document is probably written in 850 A.C. with 95% confidence.
“ For a given document with uncertain timestamp, can the contents be used to determine the timestamp
with a sufficiently high confidence? ”
Nattiya Kanhabua 9 PhD defense
Preliminaries
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped document
Similarity Scores Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Temporal Language Models [de Jong 2005]
• Based on the statistic usage of words over time
• Compare each word of a non-timestamped document with a reference corpus
• Tentative timestamp -- a time partition mostly overlaps in word usage
Nattiya Kanhabua 10 PhD defense
Improving document dating
Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights
Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2008.
Nattiya Kanhabua 11 PhD defense
Improving document dating
Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into document preprocessing
Nattiya Kanhabua 12 PhD defense
Improving document dating
Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into document preprocessing
Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the normalized log-likelihood ratio
Nattiya Kanhabua 13 PhD defense
Improving document dating
Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into document preprocessing
Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the normalized log-likelihood ratio
Intuition: A term weight depends on how good the term is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter
Nattiya Kanhabua 14 PhD defense
Experiments • Collection
– 9,000 documents collected from the Internet Archive
– 8 years time span, 15 news sources – Randomly select 1,000 documents for testing
• Results – Proposed techniques gain improvement over the baseline
• Precision = the fraction of documents correctly dated
• Open issue – The effectiveness of document dating is still limited
• Highly dependent on the quality of a reference corpus
PART II - QUERY ANALYSIS
PhD defense Nattiya Kanhabua 15
Nattiya Kanhabua 16 PhD defense
• Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time
Challenges with temporal queries
Nattiya Kanhabua 17 PhD defense
Challenges with temporal queries
• Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time
query
time1 time2
… timek
suggest
Nattiya Kanhabua 18 PhD defense
Challenges with temporal queries
• Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time
query
time1 time2
… timek
suggest
Nattiya Kanhabua 19 PhD defense
Challenges with temporal queries
• Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time
• Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time
query
synonym@2001 synonym@2002
… synonym@2011
suggest
RQ2: Determining time of queries
PhD defense Nattiya Kanhabua 20
Problem Statements • 1.5% of web queries are explicitly provided with temporal
expression [Nunes 2008] – Time is a part of query, “U.S. Presidential election 2008”
• About 7% of web queries have temporal intent implicitly provided [Metzler 2009] – Time is not given in queries, e.g., “Germany World Cup” or
“tsunami” – Difficult to achieve high accuracy using only keywords – Relevant documents associated to particular time not given
Nattiya Kanhabua 21 PhD defense
1. Determining the time of queries when no time is given 2. Re-ranking search results using the determined time
Our contributions
Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2010.
Nattiya Kanhabua 22 PhD defense
Approach I. Dating using keywords*
Approach II. Dating using top-k documents* – Queries are short keywords – Inspired by pseudo-relevance feedback
Approach III. Using timestamp of top-k documents
– No temporal language models are used
*Using Temporal Language Models proposed by de Jong et al.
Determining time of queries
Nattiya Kanhabua 23 PhD defense
• Intuition: documents published closely to the time of queries are more relevant – Assign document priors based on publication dates
Re-ranking search results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
D2005
Re-ranked results
Nattiya Kanhabua 24 PhD defense
Determining the time of queries • Collection
– NYT Corpus contains over 1.8M (1987-2007) – 30 time-sensitive queries from the TREC Robust2004
• Results – The smaller top-k, the better precision (k=5 > k=10 > k=15) – The larger g (granularity), the better precision (g=12-month > g=6-month)
Experiments: Part 1 Precision = the fraction of queries correctly dated
Nattiya Kanhabua 25 PhD defense
Re-ranking of search results • Collection
– TREC Robust2004, 30 time-sensitive queries – NYT Corpus, 24 queries from Google zeitgeist
• Results – Approach III (no TMLs) outperforms all other approaches
• Using publication dates is more accurate than the dating process
• Open issue – Time can improve the effectiveness (if the query dating is improved
with a higher accuracy)
Experiments: Part 2
Nattiya Kanhabua 26 PhD defense
Challenges of temporal search • Semantic gaps: lacking knowledge about
1. possibly relevant time of queries 2. terminology changes over time
query
synonym@2001 synonym@2002
… synonym@2011
suggest
Nattiya Kanhabua 27 PhD defense
Problem Statements • Queries composed of named entities (people, organization,
location) – Highly dynamic in appearance, i.e., relationships between terms
changes over time – E.g. changes of roles, name alterations, or semantic shift
RQ3: Handling terminology changes
Scenario 1 Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant
Scenario 2 Query: “Hillary R. Clinton” and written from 1997 to 2002 Documents about “New York Senator” and “First Lady of the United States” are relevant
Nattiya Kanhabua 28 PhD defense
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
Nattiya Kanhabua 29 PhD defense
• Discover time-based synonyms over time using Wikipedia – Generally, synonyms are words with similar meanings – This work refers synonyms as alternative names of an entity
• Improve the accuracy of time of synonyms • Query expansion using time-based synonyms
Our contributions
Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.
Nattiya Kanhabua 30 PhD defense
Recognize named entities
Nattiya Kanhabua 31 PhD defense
Recognize named entities
Nattiya Kanhabua 32 PhD defense
Recognize named entities
Nattiya Kanhabua 33 PhD defense
Find synonyms
• Find a set of entity-synonym relationships at time tk
• For each ei ϵ Etk , extract anchor texts from article links: – Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004
President_of_the_United_States George W.
Bush
George W. Bush
President George W. Bush
President Bush (43)
Initial results
• Time periods are not accurate
Note: the time of synonyms are timestamps of Wikipedia articles (8 years)
PhD defense Nattiya Kanhabua 34
• Analyze NYT Corpus to discover more accurate time – 20-year time span (1987-2007)
• Use the burst detection algorithm [Kleinberg 2003] – Time periods of synonyms = burst intervals
Enhancement using NYT
PhD defense Nattiya Kanhabua 35
Initial results
Nattiya Kanhabua 36 PhD defense
Query expansion
1. A user enters an entity as a query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
Nattiya Kanhabua 37 PhD defense
Query expansion
1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
Nattiya Kanhabua 38 PhD defense
Query expansion
1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query 3. The user select synonyms to expand the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
Nattiya Kanhabua 39 PhD defense
Part 1- Synonym detection • Collection
– The whole history of English Wikipedia • all pages and revisions 03/2001 to 03/2008 • 85 month snapshots about 2.8 Terabytes
• Result – Randomly selected 500 entity-synonym relationships for evaluating
• Accuracy 51% for all types of entities • Accuracy 73% for people, organization, and company
Part 2 - Query expansion • Collection
– TREC Robust2004 Track (250 queries) – NewsLibrary.com over 100M U.S. news articles (20 temporal queries) – Result
• Baseline: Probabilistic Model without query expansion • QE significantly improves the effectiveness over the baseline for both collections
• Open issues – Only the name changes of famous persons can be discovered
Experiments
Nattiya Kanhabua 40 PhD defense
Two problems are addressed 1. Performance prediction
• Predict the retrieval effectiveness wrt. a ranking model
Query prediction problems
query precision = ?
recall = ? MAP = ?
predict
Nattiya Kanhabua 41 PhD defense
Two problems are addressed 1. Performance prediction
• Predict the retrieval effectiveness wrt. a ranking model
2. Ranking prediction • Predict the ranking model that is most suitable
Query prediction problems
query ranking = ?
predict max(precision) max(recall) max(MAP)
Nattiya Kanhabua 42 PhD defense
Problem Statement • Predict the effectiveness (e.g., MAP) that a query will achieve
in advance of, or during retrieval [Hauff 2010] – high MAP “good” – low MAP “poor”
Objective • Apply query enhancement techniques to improve the
overall performance – Query suggestion is applied for “poor” queries
• To best of our knowledge, predicting the performance of temporal queries has never done before
RQ4: Query performance prediction
Nattiya Kanhabua 43 PhD defense
• Contributions – First study of performance prediction for temporal queries – Propose 10 time-based pre-retrieval predictors
• Both text and time are considered • Experiment
– Collection: NYT Corpus and 40 temporal queries [Berberich 2010] • Results
– Time-based predictors outperform keyword-based predictors – Combined predictors outperform single predictors in most cases
• Open issue – Increase the number of queries – Consider time uncertainty
Discussion
Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
Nattiya Kanhabua 44 PhD defense
• Problem statement – Two time dimensions: publication time and content time
• Content time = temporal expressions mentioned in documents
– Difference in effectiveness for temporal queries when ranking using publication time or content time
RQ5: Time-aware ranking prediction
Nattiya Kanhabua 45 PhD defense
• Contributions – First study of the impact on effectiveness of ranking models
using the two time dimensions – Three features from analyzing top-k documents
• Temporal KL-divergence [Diaz 2004] • Content Clarity [Cronen-Townsend 2002] • Divergence of retrieval scores [Peng 2010]
• Results – A small number of top-k documents achieves better
performance – The larger number k, the more irrelevant documents are
introduced into the analysis • Open issue
– When comparing with the optimal case there is still room for further improvements
Discussion
Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, (under submission).
PART III - RETRIEVAL AND RANKING MODELS
PhD defense Nattiya Kanhabua 46
Nattiya Kanhabua 47 PhD defense
• Problem statements – Time must be explicitly modeled in order to increase
the effectiveness – Time uncertainty should be taken into account
• Two temporal expressions can refer to the same time period even though they are not equally written
• Example – Given the query “Independence Day 2011”, a retrieval
model relying on term-matching will fail to retrieve documents mentioning “July 4, 2011”
RQ6: Time-aware ranking models
Nattiya Kanhabua 48 PhD defense
• Contributions – Analyze and compare five ranking methods
• Experiment – Collection: NYT Corpus and 40 temporal queries[Berberich 2010]
• Result – TSU outperforms other methods significantly for most metrics
• Conclusions – Although TSU gains the best performance, it is limited for a
document collection with no time metadata – LMT, LMTU can be applied to any collection without time
metadata, but extraction of temporal expressions is needed.
Discussion
Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
Nattiya Kanhabua 49 PhD defense
• Problem statement – Can the combination of time and other features help improving
the retrieval effectiveness?
RQ7:Ranking related news predictions
• A new task called ranking related news predictions – Retrieve predictions related to a news story in news archives – Rank them according to their relevance to the news story
Nattiya Kanhabua 50 PhD defense
Related news predictions
Nattiya Kanhabua 51 PhD defense
• Define the task ranking related news predictions – Searching the future is proposed in [Baeza-Yates 2005]
• Propose four classes of features – Term similarity, entity-based similarity, topic similarity and
temporal similarity • Rank predictions using learning-to-rank [Liu 2009] • Make available the dataset with over 6000 judgments
Contributions
Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
Nattiya Kanhabua 52 PhD defense
• NYT Corpus – More than 25% contain at least one prediction
• Feature analysis – Topic features play an important role in ranking – Features in top-5 features with lowest weights are entity-
based features
• Open issues – Extract predictions from other sources, e.g., Wikipedia,
blogs, comments, etc. – Sentiment analysis for future-related information.
Experiments
Nattiya Kanhabua 53 PhD defense
Solutions to all research questions: Part I - Content Analysis RQ1: How to determine time of non-timestamped documents?
Part II - Query Analysis RQ2: How to determine time of queries? RQ3: How to handle terminology changes over time? RQ4: How to predict the effectiveness of temporal queries? RQ5: How to predict the suitable time-aware ranking?
Part III - Retrieval and Ranking Models RQ6: How to model time into retrieval and ranking? RQ7: How to combine different features and time for ranking?
Conclusions
Nattiya Kanhabua 54 PhD defense
• Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2008.
• Nattiya Kanhabua and Kjetil Nørvåg, Using temporal language models for document dating, In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2009
• Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, QUEST: query expansion using synonyms over time, In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2010.
• Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua, Roi Blanco and Michael Matthews, Ranking Related News Predictions, In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.
• Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, Technical Report.
Publications
Nattiya Kanhabua 55 PhD defense
• [Baeza-Yates 2005] R. A. Baeza-Yates. Searching the future. In Proceedings of SIGIR workshop on mathematical/formal methods in information retrieval MF/IR, SIGIR ’05, 2005.
• [Berberich 2010] K. Berberich, S. J. Bedathur, O. Alonso, and G. Weikum. A language modeling approach for temporal information needs. In Proceedings of the 32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 13-25, 2010.
• [Cronen-Townsend 2002] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’02, pp. 299-306, 2002.
• [Diaz 2004] F. Diaz and R. Jones. Using temporal profiles of queries for precision prediction. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04, pp. 18-24, 2004.
• [Hauff 2010] C. Hauff, L. Azzopardi, D. Hiemstra, and F. de Jong. Query performance prediction: Evaluation contrasted with effectiveness. In Proceedings of the 32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 204-216, April 2010.
• [de Jong 2005] F. de Jong, H. Rode, and D. Hiemstra. Temporal language models for the disclosure of historical text. In Humanities, computers and cultural heritage: Proceedings of the 16th International Conference of the Association for History and Computing, AHC '05, pp. 161-168, 2005.
• [Kleinberg 2003] J. Kleinberg. Bursty and hierarchical structure in streams. Data Min. Knowl. Discov., 7:pp. 373-397, October 2003.
• [Liu 2009] T-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):pp. 225-331, March 2009.
• [Metzler 2009] D. Metzler, R. Jones, F. Peng, and R. Zhang. Improving search relevance for implicitly temporal queries. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’09, pp. 700-701, 2009.
• [Nunes 2008] S. Nunes, C. Ribeiro, and G. David. Use of temporal expressions in web search. In Proceedings of the 30th European Conference on IR Research on Advances in Information Retrieval, ECIR ’08, pp. 580-584, 2008.
• [Peng 2010] J. Peng, C. Macdonald, and I. Ounis. Learning to select a ranking function. In Proceedings of the 32nd European Conference on IR Research on Advances in Information Retrieval, ECIR ’10, pp. 114-126, 2010.
References
Thank you
PhD defense Nattiya Kanhabua 56
top related