twitter as a personalizable information service ii

Twitter as a Personalizable Information Service

Part 22015/11/9 John

Outline● Review Part 1● Related work

○ Aggregation, Propagation, and Recommendation Through Social Network○ Event Identification in Social Network○ Content Personalization

● Implementation○ Real-time Vectorization of Tweets○ Analyzing the Stream of Information by Taking into Account Temporal Conditions○ PageRank

● Summary● Next

Review: AbstractTwitter, making this information system one of the fastest in the world. This chapter introduces a novel topic-detection method about the most emerging arguments expressed by the Twitter network around his/her domain interests. It propose an innovative term aging model, based on a biological metaphor, to retrieve the freshest arguments of discussion.

Review: What’s Data MiningThe process of analyzing data from different perspectives and summarizing it into useful information.

“You really understand your client or customer?”

Review: IntroductionSimplified notion:

TopicsExtracting emerging topics

Personalizing

Related workIn the last decade, the enormous amount of content generated by Web users created new challenges and new research questions within the data mining community, those works are related to the research of this chapter.

Survey fields:

1. Aggregation, Propagatoin, and Recommendation information from large-scale social network.2. Automatic detection of events within user-generated environment3. Current personalization and current user context analysis

Aggregation, Propagation, and Recommendation Through Social Network

● First issue: Aggregation of content through fliter and merging. There are two main approaches:Collaborate filtering: Selecting and proposing content by looking at what similar users have already selected. (i.e., filter what users select)Content-based techniques: Analyze the semantics of the content without considering its origin. (i.e., from only text content)

● Second issue: Analysis of how much information spreads. In general, analyze from as like with the spread of a disease in a social envrionment.

● Trendistic and Twopular repersent two examples which it is possible to analyze the trends of keywords along a timeline specified by the user.

Event Identification in Social Network● Identifying events in real-time on Twitter is a challenging problem, due to the

heterogeneity(diverse) and immense scale of the data.● A research defines a typology of five generic classes of tweets: News, Events,

Opinions, Deals, Private messages.● The real time social content can also be seen as a sensor that captures what

is happening in the world: similarly to the recommendation task, this can be exploited for a zero-delay information broadcasting system that detects emerging cocepts.

● All the techniques rely on some measure of importance of the keywords. TF-IDF to avoid the collapse of important terms when they appear in many text documents.

Content Personalization● Also cover this survey as relevant work since the system of this chapter

includes a module for the personalization of the emerging topics to be retrieved from Twitter.

● There obviously exists several approaches to facing such as a task.● Depending on the domain, one may be interested in ‘re-rank’ the results

based on their relevance, rather than ‘diversify’ them.● Brief overview of the system in this chapter, it can be classified as a ‘re-rank’

approach.

ImplementationThis section illustrates the method for analyzing, in real-time, the dynamic stream of information expressed by the Twitter community and retrieve the most emerging topics within the user’s interests.

1. First, a set of tweets is generated within a specific time interval, is represented as a set of keyword vectors. (term vectors)

2. A term aging model monitors the usage of each keyword over the time.3. Moreover, the social reputation of the Twitter users is leveraged to balance

the importance of the information expressed by the community.4. Finally, the user context is taken into account, provided by the generated set

of tweets, to highlight the most emerging topics within user’s interests.

Real-time Vectorization of Tweets● In most Information Retrieval(IR) system, the first step is extraction of the

relevant keywords. (called ‘term’ in this chapter)● Considered a interval, I, which is at tth and given a time range r:

● The corpus is extracted, with text tweets extracted during the time interval.

● Each component of the vector represents a weighted term extracted from the related tweet vector . And weight of the xth vocabulary term in jth tweet by using the argmented normal frequency:

TF-IDF

Analyzing the Stream of Information by Taking into Account Temporal Conditions

● Generally speaking, a term can be viewed as a semantic unit which can potentially link to a new event.

● This section uses a content aging theory to automatically identify coherent discussions through a life cycle-based content model.

● Many conventional clustering and classification strategies can not be applied to this problem due to the fact that tend to ignored the temporal relationships(about time aspect) amoung documents(tweets).

● A keyword of the life cycle can be considered as analogous(like) to the one of a living being(living thing) with abundant nourishment (i.e., related tweets).

● However, a keyword or a live form dies when nourishment(food) becomes insufficient.


● Relaying on this live analogy.● It is possible to evaluate the usage of a keyword by its burstiness.● Burstiness indicates the vitality status of the keyword and can qualify the

keyword’s usage.● High burstiness or low burstiness implies that the term is becoming important

or not.● Therefore, the system uses the concept of authority to define the quality of

the nutrition that each tweet gives to every contained keyword.● Different tweets containing the same keyword generate different amount of

nutrition(i.e., calorie) in the community:


Reputation of the users

● In Twitter, the social model enables to define an author-based graph,● Reputation can be extended by taking into account the fact that the

importance of a user. It is also related to the degree of importance of its followers.

● We can refer to the well-known PageRank algorithm for this task that calculates the reputation as follow:

Note: Later, it is explained by a simple example of PageRank.


Computing term Burstiness values

● Once the nutrition of a term is calculated, the aim is to map into a value of burstiness.

● The burstiness value indicates a term’s actual contribution(i.e., how much it is emergent) in the corpus of tweets.

● A keyword is defined as emergent if it results to be hot in the considered time interval.

● We analyze the keyword life cycles by comparing their nutrition values obtained on the considered time frame withe the usage of the same terms in the past time interval. Namely, the current nourishment is analyzed in comparison to the ones built in the previous time internals.


Computing term Burstiness values

● If its nutrition value stays constant during closer time intervals, it means that community is probably still referring to the same news event.

● Event if the keyword can be considered as hot, it can not be referred as emergent due to the temporal discrimination. (Temporal parameter influences the emerging keyword retrieved by the system)

● A parameter s, where , that limits the number of previous time slots considered by the system to study the keywords life cycles and defines the history worthiness of the resulting emerging keywords.

PageRankPageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Random surfer, where d = 0.85 (usually)

Note: It can be seen a Markov Chain.

Keyword: PageRank, Google search, Markov, WebCrawler, SEO

PageRankExample in notion (random traverse method)

Suppose it is all pages in the world.Compute PR(A), PR(B), PR(C), PR(D)

1. Traverse pages until reached 1000 times.

2. Traverse pages until reached 1 million times.

AD

C

B

PageRankSimply Example:

Traverse 1000 times

PR(A): 118 times

PR(B): 109 times

PR(C): 13 times

PR(D): 144 times

AD

C

B

PageRankSimply Example:

Traverse 1 million times

PR(A): 13%

PR(B): 10%

PR(C): 2%

PR(D): 15%

7%

4%

4%4%

A

14%

10%2%

2%2%

D

C

2%B

5%

4%

Summary● Currently, a lot of researches work on Aggregation, Propagatoin, and

Recommendation information from Large scale social network.● Automatic detection and Personalization on User context become intesesting

and important.● From implementation, first is extracting the keywords and its weights from

tweets in the interval.● Computing the user’s reputation and term’s burtiness values.● Burtiness values of the emerging keyword is influenced with temporal

parameter. (interval and its past interval)● Compute reputation likes to do PageRank, they are on a direct graph network.

NextContinue

Implementation:

● Selection of Emerging Terms● Leveraging User’ Context for Persionalization Purposes● From Emerging Terms to Emergin Topics● Topic Detection, Labeling, and Ranking

Experiments

● Case and User Studies

References● Wiki of TF-IDF - https://en.wikipedia.org/wiki/Tf%E2%80%93idf● Wiki of Markov Chain - https://en.wikipedia.org/wiki/Markov_chain● Wiki of PageRank - https://en.wikipedia.org/wiki/PageRank● PageRank how it work - http://goo.gl/bbShFd● Nine Algorithms That Changed the Future - http://goo.gl/Y9BFmO

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

https://en.wikipedia.org/wiki/Markov_chain

https://en.wikipedia.org/wiki/PageRank

http://goo.gl/bbShFd

http://goo.gl/Y9BFmO

twitter as a personalizable information service ii

Engineering