time-aware approaches to information retrievalkanhabua/thesis/defense_slides.pdf · time-aware...

Time-aware Approaches to Information Retrieval

Nattiya Kanhabua

Department of Computer and Information Science Norwegian University of Science and Technology

24 February 2012

Nattiya Kanhabua 2 PhD defense

• Searching documents created/edited over time – E.g., web archives, news archives, blogs, or emails

Motivation

Web archives

news archives

blogs emails

“temporal document collections”

Retrieve documents about Pope Benedict XVI written

before 2005

Term-based IR approaches may give unsatisfied results

• A web archive search tool by the Internet Archive – Query by a URL, e.g., http://www.ntnu.no

Wayback Machine1

No keyword query

No relevance ranking

1Retrieved on 15 January 2011

• A news archive search tool by Google – Query by keywords – Rank results by relevance or date

Google News Archive Search

Not consider terminology changes over time

• Study problems of temporal search • Propose approaches to solve the problems

• Main research question “How to exploit temporal information in documents, queries,

and external sources in order to improve the retrieval effectiveness?”

Objective of PhD thesis

Part I - Content Analysis RQ1: How to determine time of non-timestamped documents?

Part II - Query Analysis RQ2: How to determine time of queries? RQ3: How to handle terminology changes over time? RQ4: How to predict the effectiveness of temporal queries? RQ5: How to predict the suitable time-aware ranking?

Part III - Retrieval and Ranking Models RQ6: How to model time into retrieval and ranking? RQ7: How to combine different features and time for ranking?

Outline contributions

PART I - CONTENT ANALYSIS

PhD defense Nattiya Kanhabua 7

Problem Statements • Difficult to find the trustworthy time for web documents

– Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date

RQ1: Determining time of documents

I found a bible-like document. But I have no idea when it was

created?

Let’s me see… This document is probably written in 850 A.C. with 95% confidence.

“ For a given document with uncertain timestamp, can the contents be used to determine the timestamp

with a sufficiently high confidence? ”

Preliminaries

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped document

Similarity Scores Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Temporal Language Models [de Jong 2005]

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

Improving document dating

Three enhancement techniques: 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights

Nattiya Kanhabua and Kjetil Nørvåg, Improving Temporal Language Models For Determining Time of Non-Timestamped Documents, In Proceedings of European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2008.

Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy

Approach: Integrate semantic-based techniques into document preprocessing

Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition

Approach: Linearly combine a GZ score with the normalized log-likelihood ratio

Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition

Approach: Linearly combine a GZ score with the normalized log-likelihood ratio

Intuition: A term weight depends on how good the term is for separating time partitions (discriminative)

Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter

Experiments • Collection

– 9,000 documents collected from the Internet Archive

– 8 years time span, 15 news sources – Randomly select 1,000 documents for testing

• Results – Proposed techniques gain improvement over the baseline

• Precision = the fraction of documents correctly dated

• Open issue – The effectiveness of document dating is still limited

• Highly dependent on the quality of a reference corpus

PART II - QUERY ANALYSIS

• Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time

Challenges with temporal queries

time1 time2

… timek

suggest

time1 time2

… timek

suggest

synonym@2001 synonym@2002

… synonym@2011

suggest

RQ2: Determining time of queries

Problem Statements • 1.5% of web queries are explicitly provided with temporal

expression [Nunes 2008] – Time is a part of query, “U.S. Presidential election 2008”

• About 7% of web queries have temporal intent implicitly provided [Metzler 2009] – Time is not given in queries, e.g., “Germany World Cup” or

“tsunami” – Difficult to achieve high accuracy using only keywords – Relevant documents associated to particular time not given

1. Determining the time of queries when no time is given 2. Re-ranking search results using the determined time

Our contributions

Nattiya Kanhabua and Kjetil Nørvåg, Determining Time of Queries for Re-ranking Search Results, In Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2010.

Approach I. Dating using keywords*

Approach II. Dating using top-k documents* – Queries are short keywords – Inspired by pseudo-relevance feedback

Approach III. Using timestamp of top-k documents

– No temporal language models are used

*Using Temporal Language Models proposed by de Jong et al.

Determining time of queries

• Intuition: documents published closely to the time of queries are more relevant – Assign document priors based on publication dates

Re-ranking search results

News archive

Determine time 2005, 2004, 2006, ...

Initial retrieved results

Re-ranked results

Determining the time of queries • Collection

– NYT Corpus contains over 1.8M (1987-2007) – 30 time-sensitive queries from the TREC Robust2004

• Results – The smaller top-k, the better precision (k=5 > k=10 > k=15) – The larger g (granularity), the better precision (g=12-month > g=6-month)

Experiments: Part 1 Precision = the fraction of queries correctly dated

Re-ranking of search results • Collection

– TREC Robust2004, 30 time-sensitive queries – NYT Corpus, 24 queries from Google zeitgeist

• Results – Approach III (no TMLs) outperforms all other approaches

• Using publication dates is more accurate than the dating process

• Open issue – Time can improve the effectiveness (if the query dating is improved

with a higher accuracy)

Experiments: Part 2

Challenges of temporal search • Semantic gaps: lacking knowledge about

1. possibly relevant time of queries 2. terminology changes over time

synonym@2001 synonym@2002

… synonym@2011

suggest

Problem Statements • Queries composed of named entities (people, organization,

location) – Highly dynamic in appearance, i.e., relationships between terms

changes over time – E.g. changes of roles, name alterations, or semantic shift

RQ3: Handling terminology changes

Scenario 1 Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant

Scenario 2 Query: “Hillary R. Clinton” and written from 1997 to 2002 Documents about “New York Senator” and “First Lady of the United States” are relevant

QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

• Discover time-based synonyms over time using Wikipedia – Generally, synonyms are words with similar meanings – This work refers synonyms as alternative names of an entity

• Improve the accuracy of time of synonyms • Query expansion using time-based synonyms

Our contributions

Nattiya Kanhabua and Kjetil Nørvåg, Exploiting Time-based Synonyms in Searching Document Archives, In Proceedings of the ACM/IEEE Conference on Digital Libraries (JCDL), 2010.

Recognize named entities

Find synonyms

• Find a set of entity-synonym relationships at time tk

• For each ei ϵ Etk , extract anchor texts from article links: – Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004

President_of_the_United_States George W.

George W. Bush

President George W. Bush

President Bush (43)

Initial results

• Time periods are not accurate

Note: the time of synonyms are timestamps of Wikipedia articles (8 years)

• Analyze NYT Corpus to discover more accurate time – 20-year time span (1987-2007)

• Use the burst detection algorithm [Kleinberg 2003] – Time periods of synonyms = burst intervals

Enhancement using NYT

Initial results

Query expansion

1. A user enters an entity as a query

Query expansion

1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query

Query expansion

1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query 3. The user select synonyms to expand the query

Part 1- Synonym detection • Collection

– The whole history of English Wikipedia • all pages and revisions 03/2001 to 03/2008 • 85 month snapshots about 2.8 Terabytes

• Result – Randomly selected 500 entity-synonym relationships for evaluating

• Accuracy 51% for all types of entities • Accuracy 73% for people, organization, and company

Part 2 - Query expansion • Collection

– TREC Robust2004 Track (250 queries) – NewsLibrary.com over 100M U.S. news articles (20 temporal queries) – Result

• Baseline: Probabilistic Model without query expansion • QE significantly improves the effectiveness over the baseline for both collections

• Open issues – Only the name changes of famous persons can be discovered

Experiments

Two problems are addressed 1. Performance prediction

• Predict the retrieval effectiveness wrt. a ranking model

Query prediction problems

query precision = ?

recall = ? MAP = ?

predict

Two problems are addressed 1. Performance prediction

• Predict the retrieval effectiveness wrt. a ranking model

2. Ranking prediction • Predict the ranking model that is most suitable

Query prediction problems

query ranking = ?

predict max(precision) max(recall) max(MAP)

Problem Statement • Predict the effectiveness (e.g., MAP) that a query will achieve

in advance of, or during retrieval [Hauff 2010] – high MAP “good” – low MAP “poor”

Objective • Apply query enhancement techniques to improve the

overall performance – Query suggestion is applied for “poor” queries

• To best of our knowledge, predicting the performance of temporal queries has never done before

RQ4: Query performance prediction

• Contributions – First study of performance prediction for temporal queries – Propose 10 time-based pre-retrieval predictors

• Both text and time are considered • Experiment

– Collection: NYT Corpus and 40 temporal queries [Berberich 2010] • Results

– Time-based predictors outperform keyword-based predictors – Combined predictors outperform single predictors in most cases

• Open issue – Increase the number of queries – Consider time uncertainty

Discussion

Nattiya Kanhabua and Kjetil Nørvåg, Time-based Query Performance Predictors (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.

• Problem statement – Two time dimensions: publication time and content time

• Content time = temporal expressions mentioned in documents

– Difference in effectiveness for temporal queries when ranking using publication time or content time

RQ5: Time-aware ranking prediction

• Contributions – First study of the impact on effectiveness of ranking models

using the two time dimensions – Three features from analyzing top-k documents

• Temporal KL-divergence [Diaz 2004] • Content Clarity [Cronen-Townsend 2002] • Divergence of retrieval scores [Peng 2010]

• Results – A small number of top-k documents achieves better

performance – The larger number k, the more irrelevant documents are

introduced into the analysis • Open issue

– When comparing with the optimal case there is still room for further improvements

Discussion

Nattiya Kanhabua, Klaus Berberich and Kjetil Nørvåg, Time-aware Ranking Prediction, (under submission).

PART III - RETRIEVAL AND RANKING MODELS

• Problem statements – Time must be explicitly modeled in order to increase

the effectiveness – Time uncertainty should be taken into account

• Two temporal expressions can refer to the same time period even though they are not equally written

• Example – Given the query “Independence Day 2011”, a retrieval

model relying on term-matching will fail to retrieve documents mentioning “July 4, 2011”

RQ6: Time-aware ranking models

• Contributions – Analyze and compare five ranking methods

• Experiment – Collection: NYT Corpus and 40 temporal queries[Berberich 2010]

• Result – TSU outperforms other methods significantly for most metrics

• Conclusions – Although TSU gains the best performance, it is limited for a

document collection with no time metadata – LMT, LMTU can be applied to any collection without time

metadata, but extraction of temporal expressions is needed.

Discussion

Nattiya Kanhabua and Kjetil Nørvåg, A Comparison of Time-aware Ranking Methods (poster), In Proceedings of the 34th Annual ACMSIGIR Conference (SIGIR), 2011.

• Problem statement – Can the combination of time and other features help improving

the retrieval effectiveness?

RQ7:Ranking related news predictions

• A new task called ranking related news predictions – Retrieve predictions related to a news story in news archives – Rank them according to their relevance to the news story

time-aware approaches to information retrievalkanhabua/thesis/defense_slides.pdf · time-aware...

Documents

protect your digital enterprise - derive technologies ·...

performance-aware mobile community-based vod streaming...

algorithm engineering for large data...

priority-aware vm allocation and network bandwidth...

planning approaches to constraint-aware navigation in...

energy efficient bittorrentfor green p2p file...

context aware machine learning approaches for modeling...

emotionsonto: an ontology for developing affective...

component-oriented approaches to context-aware systems –...

a didactics-aware approach to management of learning...

distinguish between community and institutional treatment ...

recommendation approaches using context-aware coupled...

being aware of being aware

usability evaluation of context- aware mobile systems: a ......

approaches to architecture-aware parallel scientiﬂc...

taming the wild: approaches to nature in japanese early...

renal supportive caredialysis • be aware of the potential...

task-aware representation of sentences for generic text...

anne bartlett-bragg · predictions 2020 1. organisational...

towards a taxonomy of context-aware software variabilty...