hashtag retrieval in a mircroblogging environment
DESCRIPTION
Hashtag Retrieval in a Mircroblogging Environment. Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign [email protected] http:// people.lis.illinois.edu/~mefron. Microblog overview. - PowerPoint PPT PresentationTRANSCRIPT
1
Hashtag Retrieval in a Mircroblogging Environment
Miles EfronGraduate School of
Library and Information ScienceUniversity of Illinois, Urbana-Champaign
[email protected]://people.lis.illinois.edu/~mefron
2
Microblog overview
Twitter is by far the largest microblog platform (>50M tweets / day) • Brief posts• Temporally ordered• Follower / followed social model
Why would we search tweets?• Find answered questions• Gauge consensus or opinion• Find ‘elite’ web resources
~19B searches per month vs. 4B for Bing
http://searchengineland.com/twitter-does-19-billion-searches-per-month-39988
3
Anatomy of a Tweet
screen name hashtag mention
time stamp
4
#hashtags• Author-embedded metadata• Hashtags Collocate tweets• topically• contextually (e.g. #local, #toRead)• w/respect to affect (e.g. #iHateItWhen, #fail)
• This research primarily concerns retrieving topical hashtags.• Find tags to ‘follow’ • Find a canonical tag for an entity (e.g. a
conference)
5
Digression: Data Set
Items Count
Tweets 2,308,297
Date range 07/18/2010 – 08/28/2010
People 9,297
Unique hashtags 77,185
Tweets containing >= 1 tag 362,087 (~%16)
Tweets containing > 1 tag 89,978 (~%3.9)
6
HypothesesGeneral Hypothesis: Metadata in tweets can be marshaled to improve retrieval effectiveness.
Specific Hypothesis: Traditional measures of term importance (such as IDF) don’t translate to the problem of identifying useful hashtags. An alternative ‘social’ measure is more appropriate.
7
8
9
Microblog Entity Search
query entity1
entity2
entityn
…cf. Balog et al. 2006.
10
Microblog Entity Search
query entity1
entity2
entityn
…cf. Balog et al. 2006.
11
Language Modeling IRRank documents in decreasing order of their similarity to the query, where similarity is understood in a specific probabilistic context.
Assume that a document d was generated by some probability distribution M over the words in the indexing vocabulary.
What is the likelihood that the model that generated the text in d also generated the text in our query q? If the likelihood given the language model for document di is greater than the likelihood for another document dj, then we rank di, higher than dj,.
12
Language Modeling IR
d1 : this year’s sigir was better than last year’s sigir
d2 : was this (last year’s) sigir better than last
Document 1’s modelPr(better|d1) Pr(last|d1) Pr(sigir|d1) Pr(than|d1) Pr(this|d1) Pr(was|d1) Pr(year’s|d1)
1/9 1/9 2/9 1/9 1/9 1/9 2/9
Document 2’s model
Pr(better | d1) Pr(last | d1) Pr(sigir| d1) Pr(than |d1) Pr(this| d1) Pr(was| d1) Pr(year’s|d1)
1/8 2/8 1/8 1/8 1/8 1/8 1/8
13
Language Modeling IRq: this sigir
d1 : this year’s sigir was better than last year’s sigir
d2 : was this (last year’s) sigir better than last
Document 1’s modelPr(better|d1) Pr(last|d1) Pr(sigir|d1) Pr(than|d1) Pr(this|d1) Pr(was|d1) Pr(year’s|d1)
1/9 1/9 2/9 1/9 1/9 1/9 2/9
Document 2’s model
Pr(better | d1) Pr(last | d1) Pr(sigir| d1) Pr(than |d1) Pr(this| d1) Pr(was| d1) Pr(year’s|d1)
1/8 2/8 1/8 1/8 1/8 1/8 1/8
16
Language Modeling IRRank documents by the likelihood that their language models generated the query.
17
Language Modeling IRRank documents by the likelihood that their language models generated the query.
18
Language Modeling IRRank documents by the likelihood that their language models generated the query.
Often assumed to be uniform (the same for all documents) and thus dropped.
19
Language Modeling IRq: this sigir
Document 1’s modelPr(better|d1) Pr(last|d1) Pr(sigir|d1) Pr(than|d1) Pr(this|d1) Pr(was|d1) Pr(year’s|d1)
1/9 1/9 2/9 1/9 1/9 1/9 2/9
Document 2’s model
Pr(better | d1) Pr(last | d1) Pr(sigir| d1) Pr(than |d1) Pr(this| d1) Pr(was| d1) Pr(year’s|d1)
1/8 2/8 1/8 1/8 1/8 1/8 1/8
20
What is a Document?
• For a tweet n(wi, d) is self-explanatory.• For other entities (hashtags and
people), we use the “virtual document” approach (Macdonald, 2009).
21
Virtual Documents• For a hashtag hi define a virtual document di that
consists the concatenated text of all tweets containing hi.
Lorem #ipsum dolor sit amet
In blandit ipsum purus vitae
#Ipsum #sapien mollis dui
#ipsumLorem #ipsum dolor sit amet
#Ipsum #sapien mollis dui
#sapien#Ipsum #sapien mollis dui
Virtual Documents
Documents
22
Virtual Documents• For a hashtag hi define a virtual document di that
consists the concatenated text of all tweets containing hi.
#ipsumLorem #ipsum dolor sit amet
#Ipsum #sapien mollis dui
#sapien#Ipsum #sapien mollis dui
Virtual Documents
23
#sigir#sigir2010#sigir10
Does it work?Results for SIGIR
24
#sigir#sigir2010#sigir10
Does it work?#gingdotblog#recsys#msrsocpapers#kdd2010
Results for SIGIR
25
#sigir#sigir2010#sigir10
Does it work?#gingdotblog#recsys#msrsocpapers#kdd2010#tripdavisor#kannada#ecdl2010#genevaishotandhasnoairconditioning#sigir20010#wsdm2011
Results for SIGIR
26
Hashtag Priors
• Some tags are better than others.• Even if its language model is on-topic, a very
common tag (e.g. #mutread) is probably not useful.• But rarity isn’t much help• #genevaishotandhasnoairconditioning• Workhorse measures like IDF don’t get at tag
usefulness• But “document” (i.e. tag) priors offer help.
27
Hashtag Priors
• Intuition: a tag is likely to be useful if it is used in many useful tweets.
• A tweet is useful if it contains many useful tags (obviously this is an oversimplification).
28
Hashtag Priors—Analogy to PageRank
Where:• h is a hashtag.• H is the set of tags that co-occur with h.• t is a hashtag in the set H.• αis a constant so that the probabilities
sum to one.
29
Hashtag Priors—Analogy to PageRank
• These prior probabilities are the steady state of the Markov chain…
• A “random reader” model:• Reading tweets• Choosing at random what to do next:
oExamine tweets with a hashtag in the current tweet
oGo to a random, new tweet (so we need…)
30
Hashtag Priors—Analogy to PageRank
• These prior probabilities are the steady state of the Markov chain…
• A “random reader” model:• Reading tweets• Choosing at random what to do next:
oExamine tweets with a hashtag in the current tweet
oGo to a random, new tweet (so we need…)
31
Hashtag Priors—Calculation1. Initialize all n(T) tags to constant
probability.2. For each tag h:
1. find the set of tags H that co-occur with h.2. Set Pr(h) = sum of Pr(.) for all tags in H.3. Normalize all scores.
3. Iterate, repeating step 2 until convergence.
32
A Return to Intuition• Assume that if two tags co-occur in a
tweet, they share an affinity (i.e. they are linked).
• Assume that tags that occur in many tweets are highly engaged in the discourse on Twitter.
• Highly engaged tags spread their influence to those that may be less popular, but still bearing linkage to engagement.
33
Properties of Hashtag “social” Priors
Cor(freq, prior)=0.275
34
Properties of Hashtag “social” Priors+-----------------+---------+---------------------+
| tag_text | docFreq | score |+-----------------+---------+---------------------+| #linkeddata | 559 | 0.00537524473400945 || #opensource | 1054 | 0.00427572303218856 || #semanticweb | 215 | 0.00406174857132168 || #yam | 345 | 0.00269713986530859 || #rdf | 106 | 0.00257291441134344 || #hadoop | 304 | 0.00247387571898314 || #e20 | 512 | 0.00235774256615437 || #opendata | 343 | 0.00234389638939712 || #opengov | 563 | 0.00230838488530255 || #nosql | 414 | 0.00220599138375711 || #gov20 | 1964 | 0.00218390304201311 || #semweb | 116 | 0.00209311764669248 || #cio | 199 | 0.00190685555120058 || #a11y | 462 | 0.00184103588077775 || #sparql | 61 | 0.001802252610603 || #webid | 125 | 0.00170313580607837 || #semantic | 89 | 0.001699091839444 || #cloudcomputing | 123 | 0.00166678332288627 || #rdfa | 117 | 0.00165629661283845 || #oss | 75 | 0.00164741425796336 |+-----------------+---------+---------------------+
35
An example: “immigration reform”
No Priors#immigration#aussiemigration#teachers#physicians#parenting#election#twisters#reform#politics#hcreform
With Priors#immigration#politics#twisters#economist#tlot#tcot#healthcare#ocra#sgp#teaparty
36
Assessment
• 25 Test queries examined via 2 Amazon Mechanical Turk activities.• Queries were created manually.
• For each task, each query was completed by 5 people. Estimates of usefulness obtained by the average of 5 scores.
Research question: Does incorporating social priors into hashtag retrieval improve the usefulness of results?
37
Task 1: assess an individual query/model pair (10 results)
Assess:1. Overall usefulness2. Clarity of results3. Obviousness of results
(additional demographic info collected)
38
Task 2: compare the quality of two rankings (10 results each)
Assess:1. Overall usefulness2. Clarity of results3. Obviousness of results
(additional demographic info collected)
39
Task 1: Single Model Assessment
Overall Clarity Too obviousNo priors 1.808 1.883 2.512Priors 2.200 2.225 2.817% improved 21.618 18.163 (12.141)p-value 0.008 0.030 0.206
40
Task 2: Comparing ModelsOverall Clarity Too obvious
2 = priors-2 = no priors
1.6 1.28 -1.36
p-value 0.094 0.134 0.134
41
What Does Hashtag Retrieval Let Us Do?
• Ad hoc tag retrieval• Query Expansion (Efron, 2010)• Document Expansion
42
Ad hoc Retrieval
43
Query Expansion: immigration reform
Relevance model#weight(0.5
#combine(immigration reform) 0.5 #weight( 0.9652168 immigration 0.8424631 reform 0.1551001 rt 0.1448956 t 0.1413850 obama 0.1361353 law 0.1344551 aussiemigration 0.1299880 s 0.1008342 australia 0.0939461 illegal ) )
Hashtag expansion#weight( 0.5
#combine(immigration reform ) 0.5 #weight(4.48 immigration 1.357 politics 0.965 twisters 0.927 economist 0.847 tlot))
Efron (2010): %8.2 improvement over baseline, %6.92 over term-based feedback.
44
Document ExpansionBrowsers For Visually Impaired Users
Key Elements of a Startup
45
Document ExpansionBrowsers For Visually Impaired Users: #a11y #accessibility #assistive #axs #touch
Key Elements of a Startup: #startup #newtech #meetup #meetups #prodmktg #lean
46
Next Steps• Articulate and investigate two senses of
“search” on Twitter:• Searching over collected, indexed tweets.• Social search: Curious to hear from anyone who has
gotten to play with @blekko. The user-controlled sorting (what they call "slashtags") is intriguing.
• Consider document surrogates for retrieval sets.
• Information synthesis from retrieved data: “spontaneous documents.”
47
Thank You!
48
ReferencesBalog, K., Azzopardi, L., & de Rijke, M. (2009). A language modeling framework
for expert finding. Information Processing & Management, 45(1), 1-19.
Efron, M. (2010). Hashtag retrieval in a microblogging environment. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 787-788). Geneva, Switzerland: ACM.
Macdonald, C. (2009). The Voting Model for People Search (Doctoral Dissertation). University of Glasgow.