hashtag retrieval in a mircroblogging environment miles efron graduate school of library and...

46
Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign [email protected] http://people.lis.illinois.edu/~mefron 1

Upload: augustus-palmer

Post on 04-Jan-2016

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

1

Hashtag Retrieval in a Mircroblogging Environment

Miles EfronGraduate School of

Library and Information ScienceUniversity of Illinois, Urbana-Champaign

[email protected]://people.lis.illinois.edu/~mefron

Page 2: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

2

Microblog overview

Twitter is by far the largest microblog platform (>50M tweets / day) • Brief posts• Temporally ordered• Follower / followed social model

Why would we search tweets?• Find answered questions• Gauge consensus or opinion• Find ‘elite’ web resources

~19B searches per month vs. 4B for Bing

http://searchengineland.com/twitter-does-19-billion-searches-per-month-39988

Page 3: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

3

Anatomy of a Tweet

screen name hashtag mention

time stamp

Page 4: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

4

#hashtags

• Author-embedded metadata• Hashtags Collocate tweets• topically• contextually (e.g. #local, #toRead)• w/respect to affect (e.g. #iHateItWhen, #fail)

• This research primarily concerns retrieving topical hashtags.• Find tags to ‘follow’ • Find a canonical tag for an entity (e.g. a

conference)

Page 5: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

5

Digression: Data Set

Items Count

Tweets 2,308,297

Date range 07/18/2010 – 08/28/2010

People 9,297

Unique hashtags 77,185

Tweets containing >= 1 tag 362,087 (~%16)

Tweets containing > 1 tag 89,978 (~%3.9)

Page 6: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

6

Hypotheses

General Hypothesis: Metadata in tweets can be marshaled to improve retrieval effectiveness.

Specific Hypothesis: Traditional measures of term importance (such as IDF) don’t translate to the problem of identifying useful hashtags. An alternative ‘social’ measure is more appropriate.

Page 7: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

7

Page 8: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

8

Page 9: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

9

Microblog Entity Search

query entity1

entity2

entityn

…cf. Balog et al. 2006.

Page 10: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

10

Microblog Entity Search

query entity1

entity2

entityn

…cf. Balog et al. 2006.

Page 11: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

11

Language Modeling IR

Rank documents in decreasing order of their similarity to the query, where similarity is understood in a specific probabilistic context.

Assume that a document d was generated by some probability distribution M over the words in the indexing vocabulary.

What is the likelihood that the model that generated the text in d also generated the text in our query q? If the likelihood given the language model for document di is greater than the likelihood for another document dj, then we rank di, higher than dj,.

Page 12: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

12

Language Modeling IR

d1 : this year’s sigir was better than last year’s sigir

d2 : was this (last year’s) sigir better than last

Document 1’s model

Pr(better|d1) Pr(last|d1) Pr(sigir|d1) Pr(than|d1) Pr(this|d1) Pr(was|d1) Pr(year’s|d1)

1/9 1/9 2/9 1/9 1/9 1/9 2/9

Document 2’s model

Pr(better | d1) Pr(last | d1) Pr(sigir| d1) Pr(than |d1) Pr(this| d1) Pr(was| d1) Pr(year’s|d1)

1/8 2/8 1/8 1/8 1/8 1/8 1/8

Page 13: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

13

Language Modeling IRq: this sigir

d1 : this year’s sigir was better than last year’s sigir

d2 : was this (last year’s) sigir better than last

Document 1’s model

Pr(better|d1) Pr(last|d1) Pr(sigir|d1) Pr(than|d1) Pr(this|d1) Pr(was|d1) Pr(year’s|d1)

1/9 1/9 2/9 1/9 1/9 1/9 2/9

Document 2’s model

Pr(better | d1) Pr(last | d1) Pr(sigir| d1) Pr(than |d1) Pr(this| d1) Pr(was| d1) Pr(year’s|d1)

1/8 2/8 1/8 1/8 1/8 1/8 1/8

Page 14: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

16

Language Modeling IRRank documents by the likelihood that their language models generated the query.

Page 15: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

17

Language Modeling IRRank documents by the likelihood that their language models generated the query.

Page 16: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

18

Language Modeling IRRank documents by the likelihood that their language models generated the query.

Often assumed to be uniform (the same for all documents) and thus dropped.

Page 17: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

19

Language Modeling IRq: this sigir

Document 1’s model

Pr(better|d1) Pr(last|d1) Pr(sigir|d1) Pr(than|d1) Pr(this|d1) Pr(was|d1) Pr(year’s|d1)

1/9 1/9 2/9 1/9 1/9 1/9 2/9

Document 2’s model

Pr(better | d1) Pr(last | d1) Pr(sigir| d1) Pr(than |d1) Pr(this| d1) Pr(was| d1) Pr(year’s|d1)

1/8 2/8 1/8 1/8 1/8 1/8 1/8

Page 18: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

20

What is a Document?

• For a tweet n(wi, d) is self-explanatory.

• For other entities (hashtags and people), we use the “virtual document” approach (Macdonald, 2009).

Page 19: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

21

Virtual Documents• For a hashtag hi define a virtual document di that

consists the concatenated text of all tweets containing hi.

Lorem #ipsum dolor sit amet

In blandit ipsum purus vitae

#Ipsum #sapien mollis dui

#ipsumLorem #ipsum dolor sit amet

#Ipsum #sapien mollis dui

#sapien#Ipsum #sapien mollis dui

Virtual Documents

Documents

Page 20: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

22

Virtual Documents• For a hashtag hi define a virtual document di that

consists the concatenated text of all tweets containing hi.

#ipsumLorem #ipsum dolor sit amet

#Ipsum #sapien mollis dui

#sapien#Ipsum #sapien mollis dui

Virtual Documents

Page 21: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

23

#sigir#sigir2010#sigir10

Does it work?

Results for SIGIR

Page 22: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

24

#sigir#sigir2010#sigir10

Does it work?

#gingdotblog#recsys#msrsocpapers#kdd2010

Results for SIGIR

Page 23: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

25

#sigir#sigir2010#sigir10

Does it work?

#gingdotblog#recsys#msrsocpapers#kdd2010#tripdavisor#kannada#ecdl2010#genevaishotandhasnoairconditioning#sigir20010#wsdm2011

Results for SIGIR

Page 24: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

26

Hashtag Priors

• Some tags are better than others.• Even if its language model is on-topic, a very

common tag (e.g. #mutread) is probably not useful.

• But rarity isn’t much help• #genevaishotandhasnoairconditioning• Workhorse measures like IDF don’t get at tag

usefulness

• But “document” (i.e. tag) priors offer help.

Page 25: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

27

Hashtag Priors

• Intuition: a tag is likely to be useful if it is used in many useful tweets.

• A tweet is useful if it contains many useful tags (obviously this is an oversimplification).

Page 26: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

28

Hashtag Priors—Analogy to PageRank

Where:• h is a hashtag.• H is the set of tags that co-occur with h.• t is a hashtag in the set H.• αis a constant so that the probabilities

sum to one.

Page 27: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

29

Hashtag Priors—Analogy to PageRank

• These prior probabilities are the steady state of the Markov chain…

• A “random reader” model:• Reading tweets• Choosing at random what to do next:

oExamine tweets with a hashtag in the current tweet

oGo to a random, new tweet (so we need…)

Page 28: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

30

Hashtag Priors—Analogy to PageRank

• These prior probabilities are the steady state of the Markov chain…

• A “random reader” model:• Reading tweets• Choosing at random what to do next:

oExamine tweets with a hashtag in the current tweet

oGo to a random, new tweet (so we need…)

Page 29: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

31

Hashtag Priors—Calculation

1. Initialize all n(T) tags to constant probability.

2. For each tag h:1. find the set of tags H that co-occur with h.2. Set Pr(h) = sum of Pr(.) for all tags in H.3. Normalize all scores.

3. Iterate, repeating step 2 until convergence.

Page 30: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

32

A Return to Intuition

• Assume that if two tags co-occur in a tweet, they share an affinity (i.e. they are linked).

• Assume that tags that occur in many tweets are highly engaged in the discourse on Twitter.

• Highly engaged tags spread their influence to those that may be less popular, but still bearing linkage to engagement.

Page 31: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

33

Properties of Hashtag “social” Priors

Cor(freq, prior)=0.275

Page 32: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

34

Properties of Hashtag “social” Priors+-----------------+---------+---------------------+

| tag_text | docFreq | score |+-----------------+---------+---------------------+| #linkeddata | 559 | 0.00537524473400945 || #opensource | 1054 | 0.00427572303218856 || #semanticweb | 215 | 0.00406174857132168 || #yam | 345 | 0.00269713986530859 || #rdf | 106 | 0.00257291441134344 || #hadoop | 304 | 0.00247387571898314 || #e20 | 512 | 0.00235774256615437 || #opendata | 343 | 0.00234389638939712 || #opengov | 563 | 0.00230838488530255 || #nosql | 414 | 0.00220599138375711 || #gov20 | 1964 | 0.00218390304201311 || #semweb | 116 | 0.00209311764669248 || #cio | 199 | 0.00190685555120058 || #a11y | 462 | 0.00184103588077775 || #sparql | 61 | 0.001802252610603 || #webid | 125 | 0.00170313580607837 || #semantic | 89 | 0.001699091839444 || #cloudcomputing | 123 | 0.00166678332288627 || #rdfa | 117 | 0.00165629661283845 || #oss | 75 | 0.00164741425796336 |+-----------------+---------+---------------------+

Page 33: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

35

An example: “immigration reform”

No Priors#immigration#aussiemigration#teachers#physicians#parenting#election#twisters#reform#politics#hcreform

With Priors#immigration#politics#twisters#economist#tlot#tcot#healthcare#ocra#sgp#teaparty

Page 34: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

36

Assessment

• 25 Test queries examined via 2 Amazon Mechanical Turk activities.• Queries were created manually.

• For each task, each query was completed by 5 people. Estimates of usefulness obtained by the average of 5 scores.

Research question: Does incorporating social priors into hashtag retrieval improve the usefulness of results?

Page 35: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

37

Task 1: assess an individual query/model pair (10 results)

Assess:1. Overall usefulness2. Clarity of results3. Obviousness of results

(additional demographic info collected)

Page 36: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

38

Task 2: compare the quality of two rankings (10 results each)

Assess:1. Overall usefulness2. Clarity of results3. Obviousness of results

(additional demographic info collected)

Page 37: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

39

Task 1: Single Model Assessment

Overall Clarity Too obviousNo priors 1.808 1.883 2.512Priors 2.200 2.225 2.817% improved 21.618 18.163 (12.141)p-value 0.008 0.030 0.206

Page 38: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

40

Task 2: Comparing ModelsOverall Clarity Too obvious

2 = priors-2 = no priors

1.6 1.28 -1.36

p-value 0.094 0.134 0.134

Page 39: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

41

What Does Hashtag Retrieval Let Us Do?

• Ad hoc tag retrieval• Query Expansion (Efron, 2010)• Document Expansion

Page 40: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

42

Ad hoc Retrieval

Page 41: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

43

Query Expansion: immigration reform

Relevance model#weight(0.5

#combine(immigration reform) 0.5 #weight( 0.9652168 immigration 0.8424631 reform 0.1551001 rt 0.1448956 t 0.1413850 obama 0.1361353 law 0.1344551 aussiemigration 0.1299880 s 0.1008342 australia 0.0939461 illegal ) )

Hashtag expansion

#weight( 0.5 #combine(immigration reform ) 0.5 #weight(4.48 immigration 1.357 politics 0.965 twisters 0.927 economist 0.847 tlot))

Efron (2010): %8.2 improvement over baseline, %6.92 over term-based feedback.

Page 42: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

44

Document Expansion

Browsers For Visually Impaired Users

Key Elements of a Startup

Page 43: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

45

Document Expansion

Browsers For Visually Impaired Users: #a11y #accessibility #assistive #axs #touch

Key Elements of a Startup: #startup #newtech #meetup #meetups #prodmktg #lean

Page 44: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

46

Next Steps

• Articulate and investigate two senses of “search” on Twitter:• Searching over collected, indexed tweets.• Social search: Curious to hear from anyone who has

gotten to play with @blekko. The user-controlled sorting (what they call "slashtags") is intriguing.

• Consider document surrogates for retrieval sets.

• Information synthesis from retrieved data: “spontaneous documents.”

Page 45: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

47

Thank You!

Page 46: Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

48

References

Balog, K., Azzopardi, L., & de Rijke, M. (2009). A language modeling framework for expert finding. Information Processing & Management, 45(1), 1-19.

Efron, M. (2010). Hashtag retrieval in a microblogging environment. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 787-788). Geneva, Switzerland: ACM.

Macdonald, C. (2009). The Voting Model for People Search (Doctoral Dissertation). University of Glasgow.