information retrieval with time series query hyun duk kim (now at twitter), danila nikitin (now at...
TRANSCRIPT
![Page 1: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/1.jpg)
Information Retrieval with Time Series Query
Hyun Duk Kim (now at Twitter) , Danila Nikitin (now at Google), ChengXiang Zhai
University of Illinois at Urbana-Champaign
Malu Castellanos, Meichun HsuHP Laboratories
![Page 2: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/2.jpg)
… Time
Any clues in the companion news stream?Dow Jones Industrial Average [Source: Yahoo Finance]
IR for stock market analysis?
What might have caused the stock market crash?
Sept 11 Attack!
What documents to read to analyze such a “causal” topic?
![Page 3: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/3.jpg)
Analysis of Presidential Prediction Markets
What might have caused the sudden drop of price for this candidate?
What “mattered” in this election?
… Time
Any clues in the companion news stream?
Tax cut?
What documents to read to analyze such a “causal” topic?
![Page 4: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/4.jpg)
… Time
Any clues in the companion product reviews?
Analysis of Product Sales
What might have caused the decrease of sales?
safety concerns
What reviews to read to analyze such a “causal” topic?
![Page 5: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/5.jpg)
… Time
Which documents cover such a “trendy” topic?
Finding documents about “trendy” topics
Draw a “time series query”: Find documents about a topic emerging this summer, which has attracted much attention this Oct
![Page 6: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/6.jpg)
Information Retrieval with Time Series Query
• Instead of keyword query, use time series as a query Retrieve documents that contain topics that are correlated with the query time series
• Input: – Time series data with time stamp
– Text stream which is a collection of documents with time stamp within the same time period
• Output– Ranked list of documents
![Page 7: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/7.jpg)
Ideal Results of Information Retrieval with Time Series Query
2000 2001 …
News
7/3/2000
7/29/2000
8/24/2000
9/19/2000
10/15/2000
11/10/2000
12/6/2000
1/1/2001
1/27/2001
2/22/2001
3/20/2001
4/15/2001
5/11/2001
6/6/2001
7/2/2001
7/28/2001
8/23/2001
9/18/2001
10/14/2001
11/9/2001
12/5/2001
12/31/2001
010203040506070
Apple Stock Price
Date
Price
($)
RANK DATE EXCERPT
1 9/29/2000 Expect earning will be far below
2 12/8/2000 $4 billion cash in company
3 10/19/2000 Disappointing earning report
4 4/19/2001 Dow and Nasdaq soar after rate cut by Federal Reserve
5 7/20/2001 Apple's new retail store
… … …
![Page 8: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/8.jpg)
IR w/ TS - Method Overview
Sep , 2001 Oct , 2001 …
Text Stream
Non-textTime Series
Vocabulary, Word Frequency
Curves
W1
W2
W3
W4
…
Input 1
Input 2
Rank by Correlation
……………
Ranked Docu-ments
Output
… ……
… …
Input Documents
![Page 9: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/9.jpg)
IR w/ TS - Method Overview
…
Sep , 2001 Oct , 2001 …
Text Stream
Non-textTime Series
Vocabulary, Word Frequency
Curves
W1
W2
W3
W4
…
Rank by Correlation
Input 1
Input 2
……
… ………………
Ranked Docu-ments
OutputInput Documents
1. How to measure correlation between word and time series
2. How to aggregate word correlations to
rank documents
![Page 10: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/10.jpg)
Correlation Function
• Measure correlation between word frequency curve vs. input time series
1. Pearson Correlation– Basic correlation
2. Dynamic Time Warping [Senin`08]
– Capture alignment of shifted or stretched time series
Series before alignment Time series Alignment
Val
ues
Time
![Page 11: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/11.jpg)
Aggregation Function
• Score document correlation by aggregating word correlations
1. Weighted TF-IDF (BM25)– Use top K correlated words as a text query
Use IR formula such as BM25
– Use correlation coefficient as a weight
![Page 12: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/12.jpg)
Aggregation Function
2. Average Correlation
a) Average over all terms:
Not all the words are correlated?
b) Average over top-k terms:
May be dominated by multiple occurrences of the same term
c) Average over top-k unique terms:
![Page 13: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/13.jpg)
Evaluation
• Data Set– New York Times corpus (Jul 2000~Dec 2001)
• Entity annotated
– Daily Stock prices of 24 companies
• Measure– Mean average precision (MAP)
– Normalized discounted cumulative gain (NDCG)
• Research questions
1. Can our method retrieve meaningful documents?
2. Does DTW outperform Pearson Correlation?
3. Which aggregation function works the best?
![Page 14: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/14.jpg)
Top ranked documents by American Airlines stock price
Rank Date Excerpt
1 10/22/2001 Fleeing the war
2 12/11/2001 Us and anti-Taliban forces in Afghanistan
3 11/18/2001 Fate of Taliban Soldiers Under Discussion
4 11/12/2001 Tally and dead and missing in Sep 11 terrorist attacks
5 9/25/2001 Soldiers in Afghanistan …
6 11/19/2001 Recover operation at World Trade Center
7 11/3/2001 4343 died or missing as a result of the attacks on Sep 11
8 11/17/2001 Dead and missing report of Sep 11 attack
… … …
All top ranked documents are related to September 11, terrorist attack
![Page 15: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/15.jpg)
Top Correlated Words to American Airlines stock price
• All top correlated terms to input time series are related to terrorist attack
Highly correlated terms contributed to retrieval of documents about this topic
Word |ρ|
challenged 0.887031
afghanistan 0.861351
security 0.858745
sept 0.858309
terrorism 0.854865
pakistan 0.848829
aghans 0.844596
afghan 0.843481
islamic 0.842499
taliban 0.841455
![Page 16: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/16.jpg)
Top ranked ‘relevant’ documents for Apple stock price
Rank Date Excerpt
1 9/29/2000 Fourth-quarter earning far below estimates
2 12/8/2000 $4 billion reserve, not $11 billion
3 10/19/2000 Announced earnings report
4 4/29/2001 Dow and Nasdaq soar after rate cur by Federal Reserve
5 7/20/2001 Apple’s new retail stores
6 12/6/2000 Apple warns it will record quarterly loss
7 3/24/2001 Stocks perk up, with Nasdaq posing gain
8 8/10/2000 Mixing Mac and Windows
… … …• Retrieved relevant event: Disappointing earning report, store open, etc.
• Useful as a new feature for re-ranking search results?
![Page 17: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/17.jpg)
Quantitative Evaluation
• All our methods > Random precision (0.0013)
• Dynamic time warping >> Pearson correlation
Pearson DTW
MAP NDCG MAP NDCG
0.0019 0.3515 0.0022 0.3609
- Average performance (Average correlation as aggregation method)
![Page 18: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/18.jpg)
Comparison of Aggregation Methods
• AC << TopK, BM25
• Top5-AC << Top20-AC,but not more than K=20
• BM25 is sensitive to parameter setting– Scores of AC methods are
more meaningful
• Incomplete judgments Possibly much better performance in reality
MAP NDCG
AC 0.0019 0.3515
Top5-AC 0.0021 0.361
Top10-AC 0.0023 0.3618
Top20-AC 0.0024 0.3629
Top5-AC-Uniq 0.0022 0.3613
Top10-AC-Uniq 0.0022 0.3616
Top20-AC-Uniq 0.0022 0.3619
Top5-BM25 0.0019 0.3584
Top10-BM25 0.0023 0.361
Top20-BM25 0.0019 0.3582
- Average performance (w/ Pearson correlation)
![Page 19: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/19.jpg)
“Higher” NDCG vs. Low MAP
![Page 20: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/20.jpg)
Summary
• Introduced a novel retrieval problem– time series as query
• Studied basic solutions: Time series representation of terms– Term retrieval: correlation(query, term)
– Document retrieval: aggregation of term retrieval results
• Dynamic time warping + top-K average correlation seems working well
![Page 21: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/21.jpg)
Limitations & Future Work
• Evaluation is based on simulation– Highly incomplete judgments!
– What’s a good way to evaluate such a new retrieval task?
• Current solutions are heuristic– How can we develop a more principled model?
• Different notions of relevance– “Local” relevance vs. global relevance?
• All other issues relevant to a standard retrieval problem are worth exploring (e.g., feedback?)
![Page 22: Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f305503460f94c4b42e/html5/thumbnails/22.jpg)
Thank You! Comments/Questions?