analyzing real time news

29
Analyzing Realtime News Raffaele Lorusso – Marco Fusi Milan, November 2015 #RateMe

Upload: marco-fusi

Post on 23-Jan-2018

277 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Analyzing Realtime News

Raffaele Lorusso – Marco Fusi

Milan, November 2015 #RateMe

CREARELANOTIZIA

This project has been realized during the 2015-2016 master “Business Intelligence and Big Data Analytics” at Università di Milano - Bicocca CONTEXT

#RateMe

L'IT riesce a conseguire una sostanziale riduzione dei costi operativi attraverso la modernizzazione delle proprie Data Architecture. L'innovazione include l'implementazione di Active Archive per i cold data, l’offloading di processi ETL e l'enrichment dei dati.

CREARELANOTIZIA

BIGDATA Quali son le tecnologie e le potenzialità dei Big Data

Twitter as an example of new media and realtime news sharing TWITTER

#RateMe

TIMELINE

NEWSLIFECYCLE How news spreads on Twitter and other new-media

News

#RateMe

TIMELINE

NEWSLIFECYCLE How news spreads on Twitter and other new-media

Tweet News

#RateMe

TIMELINE

NEWSLIFECYCLE How news spreads on Twitter and other new-media

NewsTweet

TweetTweet

Tweet

TweetTweetTweet

Tweet

#RateMe

TIMELINE

NEWSLIFECYCLE How news spreads on Twitter and other new-media

NewsTweet

TweetTweet

Tweet

TweetTweetTweet

Tweet

Tweet

Tweet Tweet Tweet

Tweet

TweetTweet

Tweet

#RateMe

TIMELINE

NEWSLIFECYCLE How news spreads on Twitter and other new-media

News

TweetTweet

Tweet

TweetTweetTweet

Tweet

Tweet

Tweet Tweet Tweet

TweetTweet

Tweet

Tweet

Tweet Tweet Tweet

Tweet

TweetTweet

Tweet

Tweet

#RateMe

TIMELINE

NEWSLIFECYCLE How news spreads on Twitter and other new-media

Tweet

TweetTweet

Tweet

TweetTweetTweet

Tweet

Tweet

Tweet Tweet Tweet

Tweet

TweetTweet

Tweet

Tweet

Tweet

Tweet Tweet Tweet

TweetTweet

Tweet

Tweet Tweet

News

#RateMe

L'IT riesce a conseguire una sostanziale riduzione dei costi operativi attraverso la modernizzazione delle proprie Data Architecture. L'innovazione include l'implementazione di Active Archive per i cold data, l’offloading di processi ETL e l'enrichment dei dati.

Twitter is an easy way to create and share news and opinions. It’s a new flow of content and information associated with huge opportunities.

With the collected data it’s possible to conduct statystical analysis that allow us to extrapolate quantitative and qualitative indicators in order to identify trends, correlations, flows, sentiment,….

CREATE

ANALYZE

FOLLOWFollow the news evolution during the time by analyzing and contextualyizing it in the reality and comparing the externals events that can contribute to generete and modify the news itself.

#RateMe

L'IT riesce a conseguire una sostanziale riduzione dei costi operativi attraverso la modernizzazione delle proprie Data Architecture. L'innovazione include l'implementazione di Active Archive per i cold data, l’offloading di processi ETL e l'enrichment dei dati.

ARCHITECTURE Main Components

#RateMe

BA

TCH

LA

YER

SP

EED

LA

YER

DA

TA

SOU

RC

ES

Machine Learning

PRESENTATION LAYER

CREARELANOTIZIAARCHITECTURE The Lambda Architecture

#RateMe

Case Study: Big Data Ecosystem on Twitter

#RateMe

BIG DATA FRONTEND

BIG DATA BACKEND BIG DATA

FRONTEND

Big Data Ecosystem

BIG DATA BACKEND

#RateMe

Big Data Ecosystem at a glance

40k 1Month

100k

28k

170k

1.2k

30k

#RateMe

L'IT riesce a conseguire una sostanziale riduzione dei costi operativi attraverso la modernizzazione delle proprie Data Architecture. L'innovazione include l'implementazione di Active Archive per i cold data, l’offloading di processi ETL e l'enrichment dei dati. Big Data Ecosystem

#RateMe

L'IT riesce a conseguire una sostanziale riduzione dei costi operativi attraverso la modernizzazione delle proprie Data Architecture. L'innovazione include l'implementazione di Active Archive per i cold data, l’offloading di processi ETL e l'enrichment dei dati.

SENTIMENTANALYSIS

From the text of the Tweets it’s possible to compute a measure relative to the sentiment associated with it. In this project we have built two different models.

BIG DATA BACKEND

BIG DATA FRONTEND

CLUSTER THEN

PREDICT

BIG DATA BACKEND

DICTIONARY ALGORITM

#RateMe

L'IT riesce a conseguire una sostanziale riduzione dei costi operativi attraverso la modernizzazione delle proprie Data Architecture. L'innovazione include l'implementazione di Active Archive per i cold data, l’offloading di processi ETL e l'enrichment dei dati.

SENTIMENTANALYSIS

This model concept is to split a Tweet into tokens composed by the single words, and then associate a score to each word by looking in a dictionary table containing positive and negative words and a numerical score.

BIG DATA BACKEND

BIG DATA BACKEND

DICTIONARY ALGORITM

#RateMe

L'IT riesce a conseguire una sostanziale riduzione dei costi operativi attraverso la modernizzazione delle proprie Data Architecture. L'innovazione include l'implementazione di Active Archive per i cold data, l’offloading di processi ETL e l'enrichment dei dati.

SENTIMENTANALYSIS

This model is based upon clustering Tweets with similar words and then applying a Random Forest algorithm on each cluster

“Improved Twitter Sentiment prediction through Cluster then Predict Model” International Journal of Computer Science and Network, August 2015

BIG DATA FRONTEND

CLUSTER THEN

PREDICT

#RateMe

DASHBOARD*LIVEDEMO

#RateMe

CREARELANOTIZIACONCLUSIONS

• The «Lambda Architecture» seems a good approach thanks to the tradeoff between the need of RealTime Analysis and Batch computations

• The Big Data Ecosystem is composed by etherogeneous technologies and each of them solve just a part of the whole problem

• Many technlogies are easily interoperable and composable

• There are many first mover in the Big Data market but also consolidated ones that are nowdays a must have in a Big Data Architecture

Big Data Ecosystem - Architecture

#RateMe

CREARELANOTIZIA

BIGDATA

CONCLUSIONS

•  The most twitted technlogies are not always the ones that has the largest market share

•  It seems there’s no correlation between real Big Data Events and tweets volumes

•  In this case study the sentiment analysis made with the cluster then predict model is worse than the one made

with the dictionary algorithm

•  The dictionary algorithm approach is very susceptible to the usage of a good dictionary with a lot of words.

With the dictionary we used only 42% tweets were scored

•  The analysis between the senders and the mentioned users underlyned that there are many influencers who

are actually closely connected to the technologies or even the official accounts of that technlogy

•  45% of the tweets were sent by official apps from Web platform, Android and IOS

Big Data Ecosystem – Data Analysis

#RateMe

Case Study: Data Science seminar @masterBIBDA

Milan, 19 November 2015 #RateMe

Game Rate this seminar Players Our speakers and YOU!

Objectives Have Fun!

#RateMe Rules

#RateMe

Tweet to @masterbibda

Reference the keyword by using an hashtag #datascientistprofiles

Vote alto – medio - basso

Example #RateMe

#RateMe

CREIAMOLANOTIZIA

and…

Feel free to Tweet your toughts @masterbibda!

Every Tweet will be analyzed!

#RateMe

#RateMe

DASHBOARD*LIVEDEMO

#RateMe

Tweet

TweetTweet

Tweet

TweetTweetTweet

Tweet

Tweet

Tweet Tweet Tweet

Tweet

TweetTweet

Tweet

Tweet

Tweet

Tweet Tweet Tweet

TweetTweet

Tweet

Tweet Tweet

News

Enjoy #RateMe

#RateMe

Raffaele Lorusso – Marco Fusi

Milan, November 2015

THANKS!

Analyzing Realtime News

#RateMe