pete bohman adam kunk. real-time search definition: a search mechanism capable of finding...

35
TI: AN EFFICIENT INDEXING MECHANISM FOR REAL-TIME SEARCH ON TWEETS SIGMOD ‘11 C. CHEN ET AL Pete Bohman Adam Kunk

Upload: pamela-hill

Post on 13-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

TI: AN EFFICIENT INDEXING MECHANISM FOR REAL-TIME

SEARCH ON TWEETSSIGMOD ‘11

C. CHEN ET AL

Pete Bohman

Adam Kunk

Page 2: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Real-Time Search

Definition: A search mechanism capable of finding information in an online fashion as it is produced.Technology belonging to real-time web that

enables users to receive information as soon as it is published

Page 3: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Real-Time Search

In terms of real-time search, what does “online” mean?Online means that a constant stream of

input data is handled as it enters the system, contrary to batch processing

Bing Social Search

Page 4: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Real-Time Search Input Data Example of what kind of input data is

considered for real-time search systems:

twittervision

Page 5: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Real-Time Content Microblogging - Entirely new type of data

1. Short temporal life span

2. Little to no context

3. Simple ideas, fast reporting of events

4. Metadata: time, location, social links

5. Less factual, more opinionated

6. Static posts

7. Furious input rate

8. Often no hyperlink structure, few traditional ranking factors

Current search engines don’t take full advantage of this new data type

Page 6: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Real-Time vs. Conventional Search

Conventional Search RankingRelevance Authority

Real-Time Search RankingRelevanceTemporal immediacy Popularity

Page 7: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Real-Time vs. Conventional Search

Conventional search input Crawl the web periodically and update index

○ Web documents evolveIncapable of crawling and indexing the entire web in

real-time

Real-time search input Stream of data.No need to poll since the posts are static

What can we do with real-time search engines?

Page 8: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

User Query Analysis

Collecta real-time search engine Analyzed ~1 Million queries

Continuous Queries○ Monitor events by frequently resubmitting the

same query Different query categories

Conventional Real-Time

Shopping Commerce

Entertainment Travel

Adult Economy

Page 9: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Crowdsourcing Real-Time Data

Crowd sourcing of first hand reports

Page 10: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Value of Real-Time Search The estimated value of real-time search

is around $33 MillionValue derived from types of queries entered

in real-time search systemsUtilized adwords to determine worth of

keywords appearing in queries

Page 11: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Applications of Real-Time Search TwitterStand: Real-time news reports

Example: Coverage of MJ’s death

Page 12: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Applications of Real-Time Search Real-time alert systems

Leverages tweet metadata (time, location) to raise alerts

Earthquake localization based on tweets

Page 13: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Twitter Real-Time Alerts

USGS Twitter Earthquake Detector

Page 14: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Difficulties of Real-Time Search

Two factors:Efficient indexing in order to provide for fast

results

Effective ranking in order to return relevant results

Page 15: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Indexing: RDBMS RDBMS Indexing

Indexes built on columns commonly used in queries

Improves the speed of retrieval operations

Page 16: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Indexing: Conventional Search

Conventional Search (Inverted) IndexingNon structured dataIf a document does not exist in the index, it will not

appear in query results

Page 17: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Indexing: Real-Time Search

Index stream of data Map keywords to tweets containing those

keywords

ChallengeProcessing the stream in a timely manor

○ 5,000 tweets per second

Page 18: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

TI Indexing

Not feasible to index every incoming tweet immediately

Selective indexing based on results that are most likely to appear in queriesDistinguished tweets indexed in real-timeNoisy tweets indexed by batch process

Page 19: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

TI Tweet Classification

ObservationUsers are only interested in top-K results for

a query Distinguished tweets

Tweet that belongs in the top-K result set of previous query

Noisy tweetThose tweets not appearing in the top-K

results for any of the systems previous queries

Page 20: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

TI Indexing

Must limit the size of the query set1.6 Billion twitter queries per day

Page 21: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Query set optimization

Observation20% of queries represent 80% of user

requests

ThereforeZipf’s distribution used statistically limit the

number of queries tweets were compared against

Page 22: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Real-Time Search Ranking How does ranking differ from traditional

web ranking?Typical web search engines rank based on

links to a site, and links from a site (PageRank)

Microblogging data contains social networking links ○ Followers○ Friends○ Re-tweets

Page 23: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Real-Time Search Ranking Ranking is not necessary in RDBMS

systemsIn RDBMS system data is strictly defined

including algebraic operatorsResults are complete not subjective

Page 24: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

TI Ranking

Ranking function comprised of:1) User’s PageRank

○ Combination of user weight (defaulted to 1) and how many followers

they have (popularity)

2) Timestamp (self-explanatory)

3) Similarity between tweet and the query

Page 25: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

TI Ranking Ranking function also

comprised of:4) Popularity of the topic

Determined by large tweet trees

Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree

Tweet Tree Structure

Page 26: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

TI Ranking ComparisonTI Rank Vs. Time Rank

Page 27: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

What are others doing?

Page 28: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

What are others doing?

FacebookReal-Time Feed

Page 29: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Implications

New type of data not currently searchable through existing search enginesNew search tools developed for new data New user search behavior

○ Continuous search results (non-static) Advertisers

○ Chance for more targeted advertisements

Page 30: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Conclusion

TI makes use of two concepts in their real-time search of Twitter:Selective Indexing

○ Form of partial indexing, can’t afford to index every incoming tweet due to large volume of input

Ranking○ Ranking is a known technique, but

microblogging applications provide new ranking algorithms

Page 31: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Conclusion

Real-time search engines must provide:Online algorithms to handle constant input Relevant search results

Page 32: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

References TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets

http://www.comp.nus.edu.sg/~ooibc/sigmod11ti.pdf Real Time Search User Behavior

http://faculty.ist.psu.edu/jjansen/academic/jansen_real_time_search.pdf TwitterRank: Finding Topic-Sensitive Influential Twitterers

http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1503&context=sis_research Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors

http://ymatsuo.com/papers/www2010.pdf TwitterStand: News in Tweets

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.148.1477&rep=rep1&type=pdf Learning Effective Ranking Functions for Newsgroup Search

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.5556&rep=rep1&type=pdf TwitterSearch: A Comparison of Microblog Search and Web Search

http://www.stanford.edu/~dramage/papers/twitter-wsdm11.pdf TwitterVision

http://twittervision.com/ Bing Social

http://www.bing.com/social Reak tune search on the web: Queries, topics, and economic value

http://collecta.com/RealTimeSearch.pdf

Page 33: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Discussion Questions

1)  What do you think is the most innovative technique in the TI approach that led to real-time microblog search results?

Page 34: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Discussion Questions

2) Given the partial indexing optimization provided in the paper, how do you think Google could optimize their indexing algorithm in order to capture the newest content on the web?

Page 35: Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology

Discussion Questions

3) TI makes use of a ranking function in order to select tweets based on various user characteristics. What would you change about the ranking function, if anything?