hashtag retrieval in a mircroblogging environment

1

Hashtag Retrieval in a Mircroblogging Environment

Miles EfronGraduate School of

Library and Information ScienceUniversity of Illinois, Urbana-Champaign

[email protected]://people.lis.illinois.edu/~mefron

2

Microblog overview

Twitter is by far the largest microblog platform (>50M tweets / day) • Brief posts• Temporally ordered• Follower / followed social model

Why would we search tweets?• Find answered questions• Gauge consensus or opinion• Find ‘elite’ web resources

~19B searches per month vs. 4B for Bing

http://searchengineland.com/twitter-does-19-billion-searches-per-month-39988

3

Anatomy of a Tweet

screen name hashtag mention

time stamp

4

#hashtags• Author-embedded metadata• Hashtags Collocate tweets• topically• contextually (e.g. #local, #toRead)• w/respect to affect (e.g. #iHateItWhen, #fail)

• This research primarily concerns retrieving topical hashtags.• Find tags to ‘follow’ • Find a canonical tag for an entity (e.g. a

conference)

5

Digression: Data Set

Items Count

Tweets 2,308,297

Date range 07/18/2010 – 08/28/2010

People 9,297

Unique hashtags 77,185

Tweets containing >= 1 tag 362,087 (~%16)

Tweets containing > 1 tag 89,978 (~%3.9)

6

HypothesesGeneral Hypothesis: Metadata in tweets can be marshaled to improve retrieval effectiveness.

Specific Hypothesis: Traditional measures of term importance (such as IDF) don’t translate to the problem of identifying useful hashtags. An alternative ‘social’ measure is more appropriate.

9

Microblog Entity Search

query entity1

entity2

entityn

…cf. Balog et al. 2006.

10

Microblog Entity Search

query entity1

entity2

entityn

…cf. Balog et al. 2006.

11

Language Modeling IRRank documents in decreasing order of their similarity to the query, where similarity is understood in a specific probabilistic context.

Assume that a document d was generated by some probability distribution M over the words in the indexing vocabulary.

What is the likelihood that the model that generated the text in d also generated the text in our query q? If the likelihood given the language model for document di is greater than the likelihood for another document dj, then we rank di, higher than dj,.

13

Language Modeling IRq: this sigir

d1 : this year’s sigir was better than last year’s sigir

d2 : was this (last year’s) sigir better than last


1/9 1/9 2/9 1/9 1/9 1/9 2/9



1/8 2/8 1/8 1/8 1/8 1/8 1/8

16

Language Modeling IRRank documents by the likelihood that their language models generated the query.

17


18


Often assumed to be uniform (the same for all documents) and thus dropped.

19

Language Modeling IRq: this sigir


1/9 1/9 2/9 1/9 1/9 1/9 2/9



1/8 2/8 1/8 1/8 1/8 1/8 1/8

20

What is a Document?

• For a tweet n(wi, d) is self-explanatory.• For other entities (hashtags and

people), we use the “virtual document” approach (Macdonald, 2009).

21

Virtual Documents• For a hashtag hi define a virtual document di that

consists the concatenated text of all tweets containing hi.

Lorem #ipsum dolor sit amet

In blandit ipsum purus vitae

#Ipsum #sapien mollis dui

#ipsumLorem #ipsum dolor sit amet


#sapien#Ipsum #sapien mollis dui

Virtual Documents

Documents

22

Virtual Documents• For a hashtag hi define a virtual document di that

consists the concatenated text of all tweets containing hi.

#ipsumLorem #ipsum dolor sit amet


#sapien#Ipsum #sapien mollis dui

Virtual Documents

23

#sigir#sigir2010#sigir10

Does it work?Results for SIGIR

24


Does it work?#gingdotblog#recsys#msrsocpapers#kdd2010

Results for SIGIR

25


Does it work?#gingdotblog#recsys#msrsocpapers#kdd2010#tripdavisor#kannada#ecdl2010#genevaishotandhasnoairconditioning#sigir20010#wsdm2011

Results for SIGIR

26

Hashtag Priors

• Some tags are better than others.• Even if its language model is on-topic, a very

common tag (e.g. #mutread) is probably not useful.• But rarity isn’t much help• #genevaishotandhasnoairconditioning• Workhorse measures like IDF don’t get at tag

usefulness• But “document” (i.e. tag) priors offer help.

27

Hashtag Priors

• Intuition: a tag is likely to be useful if it is used in many useful tweets.

• A tweet is useful if it contains many useful tags (obviously this is an oversimplification).

28

Hashtag Priors—Analogy to PageRank

Where:• h is a hashtag.• H is the set of tags that co-occur with h.• t is a hashtag in the set H.• αis a constant so that the probabilities

sum to one.

29


• These prior probabilities are the steady state of the Markov chain…

• A “random reader” model:• Reading tweets• Choosing at random what to do next:

oExamine tweets with a hashtag in the current tweet

oGo to a random, new tweet (so we need…)

30


• These prior probabilities are the steady state of the Markov chain…

• A “random reader” model:• Reading tweets• Choosing at random what to do next:

oExamine tweets with a hashtag in the current tweet

oGo to a random, new tweet (so we need…)

31

Hashtag Priors—Calculation1. Initialize all n(T) tags to constant

probability.2. For each tag h:

1. find the set of tags H that co-occur with h.2. Set Pr(h) = sum of Pr(.) for all tags in H.3. Normalize all scores.

3. Iterate, repeating step 2 until convergence.

32

A Return to Intuition• Assume that if two tags co-occur in a

tweet, they share an affinity (i.e. they are linked).

• Assume that tags that occur in many tweets are highly engaged in the discourse on Twitter.

• Highly engaged tags spread their influence to those that may be less popular, but still bearing linkage to engagement.

33

Properties of Hashtag “social” Priors

Cor(freq, prior)=0.275

34

Properties of Hashtag “social” Priors+-----------------+---------+---------------------+

| tag_text | docFreq | score |+-----------------+---------+---------------------+| #linkeddata | 559 | 0.00537524473400945 || #opensource | 1054 | 0.00427572303218856 || #semanticweb | 215 | 0.00406174857132168 || #yam | 345 | 0.00269713986530859 || #rdf | 106 | 0.00257291441134344 || #hadoop | 304 | 0.00247387571898314 || #e20 | 512 | 0.00235774256615437 || #opendata | 343 | 0.00234389638939712 || #opengov | 563 | 0.00230838488530255 || #nosql | 414 | 0.00220599138375711 || #gov20 | 1964 | 0.00218390304201311 || #semweb | 116 | 0.00209311764669248 || #cio | 199 | 0.00190685555120058 || #a11y | 462 | 0.00184103588077775 || #sparql | 61 | 0.001802252610603 || #webid | 125 | 0.00170313580607837 || #semantic | 89 | 0.001699091839444 || #cloudcomputing | 123 | 0.00166678332288627 || #rdfa | 117 | 0.00165629661283845 || #oss | 75 | 0.00164741425796336 |+-----------------+---------+---------------------+

35

An example: “immigration reform”

No Priors#immigration#aussiemigration#teachers#physicians#parenting#election#twisters#reform#politics#hcreform

With Priors#immigration#politics#twisters#economist#tlot#tcot#healthcare#ocra#sgp#teaparty

36

Assessment

• 25 Test queries examined via 2 Amazon Mechanical Turk activities.• Queries were created manually.

• For each task, each query was completed by 5 people. Estimates of usefulness obtained by the average of 5 scores.

Research question: Does incorporating social priors into hashtag retrieval improve the usefulness of results?

37

Task 1: assess an individual query/model pair (10 results)

Assess:1. Overall usefulness2. Clarity of results3. Obviousness of results

(additional demographic info collected)

38

Task 2: compare the quality of two rankings (10 results each)

Assess:1. Overall usefulness2. Clarity of results3. Obviousness of results

(additional demographic info collected)

39

Task 1: Single Model Assessment

Overall Clarity Too obviousNo priors 1.808 1.883 2.512Priors 2.200 2.225 2.817% improved 21.618 18.163 (12.141)p-value 0.008 0.030 0.206

40

Task 2: Comparing ModelsOverall Clarity Too obvious

2 = priors-2 = no priors

1.6 1.28 -1.36

p-value 0.094 0.134 0.134

41

What Does Hashtag Retrieval Let Us Do?

• Ad hoc tag retrieval• Query Expansion (Efron, 2010)• Document Expansion

42

Ad hoc Retrieval

43

Query Expansion: immigration reform

Relevance model#weight(0.5

#combine(immigration reform) 0.5 #weight( 0.9652168 immigration 0.8424631 reform 0.1551001 rt 0.1448956 t 0.1413850 obama 0.1361353 law 0.1344551 aussiemigration 0.1299880 s 0.1008342 australia 0.0939461 illegal ) )

Hashtag expansion#weight( 0.5

#combine(immigration reform ) 0.5 #weight(4.48 immigration 1.357 politics 0.965 twisters 0.927 economist 0.847 tlot))

Efron (2010): %8.2 improvement over baseline, %6.92 over term-based feedback.

44

Document ExpansionBrowsers For Visually Impaired Users

Key Elements of a Startup

45

Document ExpansionBrowsers For Visually Impaired Users: #a11y #accessibility #assistive #axs #touch

Key Elements of a Startup: #startup #newtech #meetup #meetups #prodmktg #lean

46

Next Steps• Articulate and investigate two senses of

“search” on Twitter:• Searching over collected, indexed tweets.• Social search: Curious to hear from anyone who has

gotten to play with @blekko. The user-controlled sorting (what they call "slashtags") is intriguing.

• Consider document surrogates for retrieval sets.

• Information synthesis from retrieved data: “spontaneous documents.”

47

Thank You!

48

ReferencesBalog, K., Azzopardi, L., & de Rijke, M. (2009). A language modeling framework

for expert finding. Information Processing & Management, 45(1), 1-19.

Efron, M. (2010). Hashtag retrieval in a microblogging environment. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 787-788). Geneva, Switzerland: ACM.

Macdonald, C. (2009). The Voting Model for People Search (Doctoral Dissertation). University of Glasgow.

hashtag retrieval in a mircroblogging environment

Documents

years sigir d2

q1414language modeling

language models

document dj

query q

microblog overviewtwitter

social model

largest microblog platform