twitter as a personalizable information service ii
TRANSCRIPT
Twitter as a Personalizable Information Service
Part 22015/11/9 John
Outline● Review Part 1● Related work
○ Aggregation, Propagation, and Recommendation Through Social Network○ Event Identification in Social Network○ Content Personalization
● Implementation○ Real-time Vectorization of Tweets○ Analyzing the Stream of Information by Taking into Account Temporal Conditions○ PageRank
● Summary● Next
Review: AbstractTwitter, making this information system one of the fastest in the world. This chapter introduces a novel topic-detection method about the most emerging arguments expressed by the Twitter network around his/her domain interests. It propose an innovative term aging model, based on a biological metaphor, to retrieve the freshest arguments of discussion.
Review: What’s Data MiningThe process of analyzing data from different perspectives and summarizing it into useful information.
“You really understand your client or customer?”
Review: IntroductionSimplified notion:
TopicsExtracting emerging topics
Personalizing
Related workIn the last decade, the enormous amount of content generated by Web users created new challenges and new research questions within the data mining community, those works are related to the research of this chapter.
Survey fields:
1. Aggregation, Propagatoin, and Recommendation information from large-scale social network.2. Automatic detection of events within user-generated environment3. Current personalization and current user context analysis
Aggregation, Propagation, and Recommendation Through Social Network
● First issue: Aggregation of content through fliter and merging. There are two main approaches:Collaborate filtering: Selecting and proposing content by looking at what similar users have already selected. (i.e., filter what users select)Content-based techniques: Analyze the semantics of the content without considering its origin. (i.e., from only text content)
● Second issue: Analysis of how much information spreads. In general, analyze from as like with the spread of a disease in a social envrionment.
● Trendistic and Twopular repersent two examples which it is possible to analyze the trends of keywords along a timeline specified by the user.
Event Identification in Social Network● Identifying events in real-time on Twitter is a challenging problem, due to the
heterogeneity(diverse) and immense scale of the data.● A research defines a typology of five generic classes of tweets: News, Events,
Opinions, Deals, Private messages.● The real time social content can also be seen as a sensor that captures what
is happening in the world: similarly to the recommendation task, this can be exploited for a zero-delay information broadcasting system that detects emerging cocepts.
● All the techniques rely on some measure of importance of the keywords. TF-IDF to avoid the collapse of important terms when they appear in many text documents.
Content Personalization● Also cover this survey as relevant work since the system of this chapter
includes a module for the personalization of the emerging topics to be retrieved from Twitter.
● There obviously exists several approaches to facing such as a task.● Depending on the domain, one may be interested in ‘re-rank’ the results
based on their relevance, rather than ‘diversify’ them.● Brief overview of the system in this chapter, it can be classified as a ‘re-rank’
approach.
ImplementationThis section illustrates the method for analyzing, in real-time, the dynamic stream of information expressed by the Twitter community and retrieve the most emerging topics within the user’s interests.
1. First, a set of tweets is generated within a specific time interval, is represented as a set of keyword vectors. (term vectors)
2. A term aging model monitors the usage of each keyword over the time.3. Moreover, the social reputation of the Twitter users is leveraged to balance
the importance of the information expressed by the community.4. Finally, the user context is taken into account, provided by the generated set
of tweets, to highlight the most emerging topics within user’s interests.
Real-time Vectorization of Tweets● In most Information Retrieval(IR) system, the first step is extraction of the
relevant keywords. (called ‘term’ in this chapter)● Considered a interval, I, which is at tth and given a time range r:
● The corpus is extracted, with text tweets extracted during the time interval.
● Each component of the vector represents a weighted term extracted from the related tweet vector . And weight of the xth vocabulary term in jth tweet by using the argmented normal frequency:
TF-IDF
Analyzing the Stream of Information by Taking into Account Temporal Conditions
● Generally speaking, a term can be viewed as a semantic unit which can potentially link to a new event.
● This section uses a content aging theory to automatically identify coherent discussions through a life cycle-based content model.
● Many conventional clustering and classification strategies can not be applied to this problem due to the fact that tend to ignored the temporal relationships(about time aspect) amoung documents(tweets).
● A keyword of the life cycle can be considered as analogous(like) to the one of a living being(living thing) with abundant nourishment (i.e., related tweets).
● However, a keyword or a live form dies when nourishment(food) becomes insufficient.
Analyzing the Stream of Information by Taking into Account Temporal Conditions
● Relaying on this live analogy.● It is possible to evaluate the usage of a keyword by its burstiness.● Burstiness indicates the vitality status of the keyword and can qualify the
keyword’s usage.● High burstiness or low burstiness implies that the term is becoming important
or not.● Therefore, the system uses the concept of authority to define the quality of
the nutrition that each tweet gives to every contained keyword.● Different tweets containing the same keyword generate different amount of
nutrition(i.e., calorie) in the community:
Analyzing the Stream of Information by Taking into Account Temporal Conditions
Reputation of the users
● In Twitter, the social model enables to define an author-based graph,● Reputation can be extended by taking into account the fact that the
importance of a user. It is also related to the degree of importance of its followers.
● We can refer to the well-known PageRank algorithm for this task that calculates the reputation as follow:
Note: Later, it is explained by a simple example of PageRank.
Analyzing the Stream of Information by Taking into Account Temporal Conditions
Computing term Burstiness values
● Once the nutrition of a term is calculated, the aim is to map into a value of burstiness.
● The burstiness value indicates a term’s actual contribution(i.e., how much it is emergent) in the corpus of tweets.
● A keyword is defined as emergent if it results to be hot in the considered time interval.
● We analyze the keyword life cycles by comparing their nutrition values obtained on the considered time frame withe the usage of the same terms in the past time interval. Namely, the current nourishment is analyzed in comparison to the ones built in the previous time internals.
Analyzing the Stream of Information by Taking into Account Temporal Conditions
Computing term Burstiness values
● If its nutrition value stays constant during closer time intervals, it means that community is probably still referring to the same news event.
● Event if the keyword can be considered as hot, it can not be referred as emergent due to the temporal discrimination. (Temporal parameter influences the emerging keyword retrieved by the system)
● A parameter s, where , that limits the number of previous time slots considered by the system to study the keywords life cycles and defines the history worthiness of the resulting emerging keywords.
PageRankPageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
Random surfer, where d = 0.85 (usually)
Note: It can be seen a Markov Chain.
Keyword: PageRank, Google search, Markov, WebCrawler, SEO
PageRankExample in notion (random traverse method)
Suppose it is all pages in the world.Compute PR(A), PR(B), PR(C), PR(D)
1. Traverse pages until reached 1000 times.
2. Traverse pages until reached 1 million times.
AD
C
B
PageRankSimply Example:
Traverse 1000 times
PR(A): 118 times
PR(B): 109 times
PR(C): 13 times
PR(D): 144 times
AD
C
B
PageRankSimply Example:
Traverse 1 million times
PR(A): 13%
PR(B): 10%
PR(C): 2%
PR(D): 15%
7%
4%
4%4%
A
14%
10%2%
2%2%
D
C
2%B
5%
4%
Summary● Currently, a lot of researches work on Aggregation, Propagatoin, and
Recommendation information from Large scale social network.● Automatic detection and Personalization on User context become intesesting
and important.● From implementation, first is extracting the keywords and its weights from
tweets in the interval.● Computing the user’s reputation and term’s burtiness values.● Burtiness values of the emerging keyword is influenced with temporal
parameter. (interval and its past interval)● Compute reputation likes to do PageRank, they are on a direct graph network.
NextContinue
Implementation:
● Selection of Emerging Terms● Leveraging User’ Context for Persionalization Purposes● From Emerging Terms to Emergin Topics● Topic Detection, Labeling, and Ranking
Experiments
● Case and User Studies
References● Wiki of TF-IDF - https://en.wikipedia.org/wiki/Tf%E2%80%93idf● Wiki of Markov Chain - https://en.wikipedia.org/wiki/Markov_chain● Wiki of PageRank - https://en.wikipedia.org/wiki/PageRank● PageRank how it work - http://goo.gl/bbShFd● Nine Algorithms That Changed the Future - http://goo.gl/Y9BFmO