hashtag retrieval in a mircroblogging environment

Hashtag Retrieval in a Mircroblogging Environment

Miles EfronGraduate School of

Library and Information ScienceUniversity of Illinois, Urbana-Champaign

mefron@illinois.eduhttp://people.lis.illinois.edu/~mefron

Microblog overview

Twitter is by far the largest microblog platform (>50M tweets / day) • Brief posts• Temporally ordered• Follower / followed social model

Why would we search tweets?• Find answered questions• Gauge consensus or opinion• Find ‘elite’ web resources

~19B searches per month vs. 4B for Bing

http://searchengineland.com/twitter-does-19-billion-searches-per-month-39988

Anatomy of a Tweet

screen name hashtag mention

time stamp

#hashtags• Author-embedded metadata• Hashtags Collocate tweets• topically• contextually (e.g. #local, #toRead)• w/respect to affect (e.g. #iHateItWhen, #fail)

• This research primarily concerns retrieving topical hashtags.• Find tags to ‘follow’ • Find a canonical tag for an entity (e.g. a

conference)

Digression: Data Set

Items Count

Tweets 2,308,297

Date range 07/18/2010 – 08/28/2010

People 9,297

Unique hashtags 77,185

Tweets containing >= 1 tag 362,087 (~%16)

Tweets containing > 1 tag 89,978 (~%3.9)

HypothesesGeneral Hypothesis: Metadata in tweets can be marshaled to improve retrieval effectiveness.

Specific Hypothesis: Traditional measures of term importance (such as IDF) don’t translate to the problem of identifying useful hashtags. An alternative ‘social’ measure is more appropriate.

Microblog Entity Search

query entity1

entity2

entityn

…cf. Balog et al. 2006.

Microblog Entity Search

query entity1

entity2

entityn

…cf. Balog et al. 2006.

Language Modeling IRRank documents in decreasing order of their similarity to the query, where similarity is understood in a specific probabilistic context.

Assume that a document d was generated by some probability distribution M over the words in the indexing vocabulary.

What is the likelihood that the model that generated the text in d also generated the text in our query q? If the likelihood given the language model for document di is greater than the likelihood for another document dj, then we rank di, higher than dj,.

Language Modeling IR

d1 : this year’s sigir was better than last year’s sigir

d2 : was this (last year’s) sigir better than last

1/9 1/9 2/9 1/9 1/9 1/9 2/9

Document 2’s model

1/8 2/8 1/8 1/8 1/8 1/8 1/8

Language Modeling IRq: this sigir

d1 : this year’s sigir was better than last year’s sigir

d2 : was this (last year’s) sigir better than last

1/9 1/9 2/9 1/9 1/9 1/9 2/9

1/8 2/8 1/8 1/8 1/8 1/8 1/8

Language Modeling IRRank documents by the likelihood that their language models generated the query.

Often assumed to be uniform (the same for all documents) and thus dropped.

Language Modeling IRq: this sigir

1/9 1/9 2/9 1/9 1/9 1/9 2/9

1/8 2/8 1/8 1/8 1/8 1/8 1/8

What is a Document?

• For a tweet n(wi, d) is self-explanatory.• For other entities (hashtags and

people), we use the “virtual document” approach (Macdonald, 2009).

Virtual Documents• For a hashtag hi define a virtual document di that

consists the concatenated text of all tweets containing hi.

Lorem #ipsum dolor sit amet

In blandit ipsum purus vitae

#Ipsum #sapien mollis dui

#ipsumLorem #ipsum dolor sit amet

#sapien#Ipsum #sapien mollis dui

Virtual Documents

Documents

Virtual Documents• For a hashtag hi define a virtual document di that

consists the concatenated text of all tweets containing hi.

#ipsumLorem #ipsum dolor sit amet

#sapien#Ipsum #sapien mollis dui

Virtual Documents

#sigir#sigir2010#sigir10

Does it work?Results for SIGIR

Does it work?#gingdotblog#recsys#msrsocpapers#kdd2010

Results for SIGIR

Does it work?#gingdotblog#recsys#msrsocpapers#kdd2010#tripdavisor#kannada#ecdl2010#genevaishotandhasnoairconditioning#sigir20010#wsdm2011

Results for SIGIR

Hashtag Priors

• Some tags are better than others.• Even if its language model is on-topic, a very

common tag (e.g. #mutread) is probably not useful.• But rarity isn’t much help• #genevaishotandhasnoairconditioning• Workhorse measures like IDF don’t get at tag

usefulness• But “document” (i.e. tag) priors offer help.

Hashtag Priors

• Intuition: a tag is likely to be useful if it is used in many useful tweets.

• A tweet is useful if it contains many useful tags (obviously this is an oversimplification).

Hashtag Priors—Analogy to PageRank

Where:• h is a hashtag.• H is the set of tags that co-occur with h.• t is a hashtag in the set H.• αis a constant so that the probabilities

sum to one.

• These prior probabilities are the steady state of the Markov chain…

• A “random reader” model:• Reading tweets• Choosing at random what to do next:

oExamine tweets with a hashtag in the current tweet

oGo to a random, new tweet (so we need…)

• These prior probabilities are the steady state of the Markov chain…

• A “random reader” model:• Reading tweets• Choosing at random what to do next:

oExamine tweets with a hashtag in the current tweet

oGo to a random, new tweet (so we need…)

Hashtag Priors—Calculation1. Initialize all n(T) tags to constant

probability.2. For each tag h:

1. find the set of tags H that co-occur with h.2. Set Pr(h) = sum of Pr(.) for all tags in H.3. Normalize all scores.

3. Iterate, repeating step 2 until convergence.

A Return to Intuition• Assume that if two tags co-occur in a

tweet, they share an affinity (i.e. they are linked).

• Assume that tags that occur in many tweets are highly engaged in the discourse on Twitter.

• Highly engaged tags spread their influence to those that may be less popular, but still bearing linkage to engagement.

Properties of Hashtag “social” Priors

Cor(freq, prior)=0.275

Properties of Hashtag “social” Priors+-----------------+---------+---------------------+

| tag_text | docFreq | score |+-----------------+---------+---------------------+| #linkeddata | 559 | 0.00537524473400945 || #opensource | 1054 | 0.00427572303218856 || #semanticweb | 215 | 0.00406174857132168 || #yam | 345 | 0.00269713986530859 || #rdf | 106 | 0.00257291441134344 || #hadoop | 304 | 0.00247387571898314 || #e20 | 512 | 0.00235774256615437 || #opendata | 343 | 0.00234389638939712 || #opengov | 563 | 0.00230838488530255 || #nosql | 414 | 0.00220599138375711 || #gov20 | 1964 | 0.00218390304201311 || #semweb | 116 | 0.00209311764669248 || #cio | 199 | 0.00190685555120058 || #a11y | 462 | 0.00184103588077775 || #sparql | 61 | 0.001802252610603 || #webid | 125 | 0.00170313580607837 || #semantic | 89 | 0.001699091839444 || #cloudcomputing | 123 | 0.00166678332288627 || #rdfa | 117 | 0.00165629661283845 || #oss | 75 | 0.00164741425796336 |+-----------------+---------+---------------------+

An example: “immigration reform”

No Priors#immigration#aussiemigration#teachers#physicians#parenting#election#twisters#reform#politics#hcreform

With Priors#immigration#politics#twisters#economist#tlot#tcot#healthcare#ocra#sgp#teaparty

Assessment

• 25 Test queries examined via 2 Amazon Mechanical Turk activities.• Queries were created manually.

• For each task, each query was completed by 5 people. Estimates of usefulness obtained by the average of 5 scores.

Research question: Does incorporating social priors into hashtag retrieval improve the usefulness of results?

Task 1: assess an individual query/model pair (10 results)

Assess:1. Overall usefulness2. Clarity of results3. Obviousness of results

(additional demographic info collected)

Task 2: compare the quality of two rankings (10 results each)

Assess:1. Overall usefulness2. Clarity of results3. Obviousness of results

(additional demographic info collected)

Task 1: Single Model Assessment

Overall Clarity Too obviousNo priors 1.808 1.883 2.512Priors 2.200 2.225 2.817% improved 21.618 18.163 (12.141)p-value 0.008 0.030 0.206

Task 2: Comparing ModelsOverall Clarity Too obvious

2 = priors-2 = no priors

1.6 1.28 -1.36

p-value 0.094 0.134 0.134

What Does Hashtag Retrieval Let Us Do?

• Ad hoc tag retrieval• Query Expansion (Efron, 2010)• Document Expansion

Ad hoc Retrieval

Query Expansion: immigration reform

Relevance model#weight(0.5

#combine(immigration reform) 0.5 #weight( 0.9652168 immigration 0.8424631 reform 0.1551001 rt 0.1448956 t 0.1413850 obama 0.1361353 law 0.1344551 aussiemigration 0.1299880 s 0.1008342 australia 0.0939461 illegal ) )

Hashtag expansion#weight( 0.5

#combine(immigration reform ) 0.5 #weight(4.48 immigration 1.357 politics 0.965 twisters 0.927 economist 0.847 tlot))

Efron (2010): %8.2 improvement over baseline, %6.92 over term-based feedback.

Document ExpansionBrowsers For Visually Impaired Users

Key Elements of a Startup

Document ExpansionBrowsers For Visually Impaired Users: #a11y #accessibility #assistive #axs #touch

Key Elements of a Startup: #startup #newtech #meetup #meetups #prodmktg #lean

Next Steps• Articulate and investigate two senses of

“search” on Twitter:• Searching over collected, indexed tweets.• Social search: Curious to hear from anyone who has

gotten to play with @blekko. The user-controlled sorting (what they call "slashtags") is intriguing.

• Consider document surrogates for retrieval sets.

• Information synthesis from retrieved data: “spontaneous documents.”

Thank You!

ReferencesBalog, K., Azzopardi, L., & de Rijke, M. (2009). A language modeling framework

for expert finding. Information Processing & Management, 45(1), 1-19.

Efron, M. (2010). Hashtag retrieval in a microblogging environment. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 787-788). Geneva, Switzerland: ACM.

Macdonald, C. (2009). The Voting Model for People Search (Doctoral Dissertation). University of Glasgow.

hashtag retrieval in a mircroblogging environment

years sigir d2

q1414language modeling

language models

document dj

query q

microblog overviewtwitter

social model

largest microblog platform

Documents

hashtag hospitality presentation

luke young rhee: hashtag-hashtag presentation

hashtag #doctors20 infographic

hashtag formazione

pay it forward hashtag

ics$321$datastorage$&$retrieval sql$in$aserver$environment(...

unsupervised hashtag retrieval and visualization for crisis...

hashtag eastern romance 8d5n - makewebeasy€¦ · hashtag...

brochure hashtag

hashtag | hash tag | popular hashtags | hashtag definition |...

hashtag finally

what the hashtag

generation #hashtag ascendant

the hashtag power

wire, an open-source information retrieval environment...

radian6 #social2011 hashtag report

hashtag - mynewsdesk

hashtag marketing by amex

hashtag marketing essential - how facebook's implementation...

analysis hashtag #smcadam