how to spot first stories on twitter using storm

21
How to spot first stories on Twitter using Storm Michael Vogiatzis - @mvogiatzis Software Engineer

Upload: michael-vogiatzis

Post on 08-Jul-2015

624 views

Category:

Technology


4 download

DESCRIPTION

http://micvog.com/2013/09/08/storm-first-story-detection/

TRANSCRIPT

Page 1: How to Spot First Stories on Twitter using Storm

How to spot first stories on Twitter using Storm

Michael Vogiatzis - @mvogiatzis

Software Engineer

Page 2: How to Spot First Stories on Twitter using Storm

The Task

Find the first document in a stream of documents, which discusses about a

specific event.

@mvogiatzis

Page 3: How to Spot First Stories on Twitter using Storm

Twitter

Spam

◦ It’s Cooooooooooooolddd !! Brrrrrr…

Neutral

◦ #nowplaying ♫ Live At The BBC – Dire Straits

Events

◦ The 6.4-magnitude quake struck just after 9.20pm (CST) on Sunday in the Banda Sea northeast of East Timor.

@mvogiatzis

Page 4: How to Spot First Stories on Twitter using Storm

Algorithm

TF-IDF on input Tweet

Convert it to Vector

@mvogiatzis

Page 5: How to Spot First Stories on Twitter using Storm

TF - IDF

Split text into words

Term Frequency * Inverted Document Frequency

More frequent words – less weight

Remove out-of-vocabulary words e.g. “lol”, “the”

Remove URLs and mentions (@)

@mvogiatzis

Page 6: How to Spot First Stories on Twitter using Storm

Algorithm

TF-IDF on input Tweet

Convert it to Vector

Find N nearest neighbours

◦ Locality Sensitive Hashing

@mvogiatzis

Page 7: How to Spot First Stories on Twitter using Storm

Locality Sensitive Hashing

Data Clustering – Near neighbour search

Buckets – Hash Tables for similar documents

Random projection creates a hash

Identical hash -> nearest neighbour candidate

@mvogiatzis

Page 8: How to Spot First Stories on Twitter using Storm

Locality Sensitive Hashing cont’d

@mvogiatzis

Page 9: How to Spot First Stories on Twitter using Storm

Algorithm

TF-IDF on input Tweet

Convert it to Vector

Find N nearest neighbours

◦ Locality Sensitive Hashing

Compare distances and find the closest

If distance < threshold not a first story

@mvogiatzis

Page 10: How to Spot First Stories on Twitter using Storm

Extra Step

If Buckets distance is not short enough

Compare with a fixed number of recent tweets

Check again

@mvogiatzis

Page 11: How to Spot First Stories on Twitter using Storm

Algorithm

TF-IDF on input Tweet

Convert it to Vector

Find N nearest neighbours ◦ Locality Sensitive Hashing

Compare distances and find the closest

If distance < threshold not a first story

Else compare with X most recent tweets (optimization)

If new_distance > threshold -> first story!

@mvogiatzis

Page 12: How to Spot First Stories on Twitter using Storm

Storm

Real-time computation made easy

Page 13: How to Spot First Stories on Twitter using Storm

Storm

Distributed real-time computation system

Fault tolerant

Fast

Scalable

Guaranteed message processing

Open source

Multilang capabilities

@mvogiatzis

Page 14: How to Spot First Stories on Twitter using Storm

Elements

Streams

◦ Set of tuples

◦ Unbounded sequence of data

Spout

◦ Source of streams

Bolts

◦ Application logic

◦ Functions

◦ Streaming aggregations, joins, DB ops

@mvogiatzis

Page 15: How to Spot First Stories on Twitter using Storm

Topology

@mvogiatzis

Page 16: How to Spot First Stories on Twitter using Storm

Part I

@mvogiatzis

Page 17: How to Spot First Stories on Twitter using Storm

Part II

@mvogiatzis

Page 18: How to Spot First Stories on Twitter using Storm

Results

Input Tweet Stored Tweet Similarity score

@Real_Liam_Payne i wanna be your female

pal

i. wanna be your best

friend so follow me

0.385

RT @damnitstrue: Life

is for living, not for

stressing.

RT Life is for living, not

for stressing.

0.99

The 6.4-magnitude quake struck just after 9.20pm (CST) on Sunday in the Banda Sea northeast of

East Timor. http://t.co/UhfwC

S2xPp

Yay Sunday!

0.129

@mvogiatzis

Page 19: How to Spot First Stories on Twitter using Storm

Evaluation

Evaluation on speed-up metric

◦ 1381 % vs single threaded

◦ 372 % vs multi threaded (4 threads)

Having humans labeling tweets is hard!

Implementation tested on newswire and broadcast news

False alarms

@mvogiatzis

Page 20: How to Spot First Stories on Twitter using Storm

Future work

Reduce false alarms by using threads for topics

Image similarity detection

Audio similarity ?

◦ Hello Shazam!

@mvogiatzis

Page 21: How to Spot First Stories on Twitter using Storm

Michael Vogiatzis

Twitter: @mvogiatzis

Code on Github

http://micvog.com

◦ Next post: “7 Lessons Learned at a London startup”

@mvogiatzis