zen & the art of data mining

Post on 23-Aug-2014

1.664 Views

Category:

Science

10 Downloads

Preview:

Click to see full reader

DESCRIPTION

A talk I gave at Old Dominion University to new students from PES University in Bangalore

TRANSCRIPT

1

Old Dominion UniversityDepartment of Computer Science

Hany SalahEldeen

Hany SalahEldeen Khalil hany@cs.odu.edu

Zen & the Art of Data Mining

07-08-14

Social Media Data Collection and the path to Modeling & Predicting User Intention

Web Science & Digital Libraries Lab

2

Before we start..here is a lil bit about me…

Hany SalahEldeen

3

Hany SalahELdeen

Education:• PhD Candidate• Web Science and Digital Libraries Group

• Masters Degree in Computer Vision and Artificial Intelligence• Universitat Autonoma de Barcelona

• Bachelors of Computer Systems Engineering• University of Alexandria

Hany SalahEldeen

4

Research & Technical Experience• Microsoft Research Cairo• Google GmBH Zurich • Microsoft Inc. Mountain View• National University of Singapore

Hany SalahEldeen

5Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Web Mining Pattern Analysis Machine Learning

Human Behavioral AnalysisSocial Media Analysis

So what am I investigating?

6

Publications

Hany SalahEldeen

Shanghai CIKM 2014 Conference- 1 first author paper- 1 second author paper

London DL 2014 Conference- 1 third author paper

Malta TPDL 2013 Conference- 1 first author paper

7

Publications

Hany SalahEldeen

Indianapolis JCDL 2013 Conference- 1 first author paper

Rio de Janeiro WWW 2013 Conference- 1 first author paper

Cyprus TPDL 2012 Conference- 1 first author paper

8

Beside the perks of travelling, our research has been popular…

Hany SalahEldeen

9

MIT Technology Review

Hany SalahEldeen

10

MIT Technology Review

Hany SalahEldeen

11

MIT Technology Review

Hany SalahEldeen

12

Mashable

Hany SalahEldeen

13

Popular Mechanics

Hany SalahEldeen

14

BBC

Hany SalahEldeen

15

The Virginian Pilot

Hany SalahEldeen

16

Our Research’s Popularity

Hany SalahEldeen

• Local newspaper: The Virginia Pilot• 4 x MIT Technology Review• BBC• Mashable• The Atlantic• Yahoo News• Articles in > 11 different languages

• We have been called:• The Internet Archeologists• Web Time Travelers

17

My goal:Detect, model, and predict

user intention in social media

Hany SalahEldeen

18

Ok hold on, let’s go back to the basics…

Hany SalahEldeen

19

Web 2.0Definition: Web 2.0 is a concept that takes the network as a platform for information sharing, interoperability, user-centered design, and collaboration on the World Wide Web.*

* http://en.wikipedia.org/wiki/Web_2.0Hany SalahEldeen

20

Web 2.0• Yes, Web 2.0 is about “user-generated

content”• But explicit content contributed by

users is just 20% of what “matters”• 80% is in the implicitly contributed

data*

Hany SalahEldeen

*Toby Segaran, Programming Collective Intelligence, 2007

21

Systems & Web 2.0• Google: Utilizes PageRank which is a

technique for extracting intelligence from the link structure

• Flickr: Utilizes “interestingness” algorithm• Amazon: Utilizes “people who bought this

product also bought” feature• Pandora: Utilizes “similar artist radio”• eBay: Utilizes “reputation system”

Hany SalahEldeen

22

So why do we even care about all that?

Hany SalahEldeen

23

Power to the People!

Hany SalahEldeen

24

Power to the People!• Because analyzing a huge dataset of

millions of users will yield a lot of potential insights into: • User Experience• Marketing• Personal Taste• Human Behavior in general.

Hany SalahEldeen

25

So what is Data Mining?

Hany SalahEldeen

26

Data Mining• Definition: It is the computational process of

discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

http://en.wikipedia.org/wiki/Data_mining

Hany SalahEldeen

27

Back to my goal:

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

28

Let’s breakdown the title first…

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

29

Let’s breakdown the title first…

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

30

Scenario 1:Jenny reading Jeff’s tweets

Hany SalahEldeen

31

Michael Jackson Dies

Hany SalahEldeen

Snapshot on: June 25th 2009http://web.archive.org/web/20090625232522/http://www.cnn.com/

32

Jeff tweets about it…

Hany SalahEldeen

Published on: June 25th 2009https://twitter.com/mdnitehk/status/2333993907

33

Jeff’s friend Jenny was on a vacation in Hawaii for a month

Jenny is off the grid…

Hany SalahEldeen

34

When she came back she checked Jeff’s tweets and was shocked!

Jenny starts catching up a month later

Hany SalahEldeen

Read on: July26th 2009!https://twitter.com/mdnitehk/status/2333993907

35

She quickly clicked on the link in the tweet…

Jenny follows the link on July 26th

Hany SalahEldeenhttp://web.archive.org/web/20090726234411/http://www.cnn.com/

CNN page on: July 26th 2009

36

• Implication:• Jenny thought Jeff is making a joke about her

favorite singer and she got mad at him

• Problem:• The tweet and the resource the tweet links

to have become unsynchronized.

Jenny is confused!

Hany SalahEldeen

37

Scenario 2:The Egyptian Revolution

Hany SalahEldeen

38

The Egyptian Revolution Jan 2011

Hany SalahEldeen

39

Reading about it in Storify.com a year later in March 2012

Hany SalahEldeen

http://storify.com/maq4sure/egypts-revolution

40

I noticed some shared images are missing

Hany SalahEldeenhttp://storify.com/maq4sure/egypts-revolution

41

Some tweets are still intact

Hany SalahEldeen

https://twitter.com/miss_amy_qb/status/32477898581483521

42

…and some lost their meaning with the disappearance of the images

Hany SalahEldeen

Missing ?https://twitter.com/aishes/status/32485352102952960

https://twitter.com/omar_chaaban/status/32203697597452289

43

The tweet remains but the shared image disappeared…

Hany SalahEldeen

http://yfrog.com/h5923xrvbqqvgzj

44

• Implication:• The reader cannot understand what the

author of the tweet meant because the image is not available.

• Problem:• The post is available but the linked resource

(image) is completely missing.

Cairo….we have a problem!

Hany SalahEldeen

45

…back to the title

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

46

…back to the title

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

4747

The Anatomy of a Tweet

Hany SalahEldeen

4848

The Anatomy of a TweetAuthor’s username

Other user mention

Tweet Body

Hash TagShortened URL to resource

Publishing timestamp

SocialPost

Shared Resource

Interactionoptions

Hany SalahEldeen

4949

3 URIs = 3 Chances to fail

Hany SalahEldeen

http://news.blogs.cnn.com/2012/04/26/norwegians-sing-to-annoy-mass-killer/

https://twitter.com/KentEiler/status/195535749754527745

5050

…t1

t4

t2

t3 t5t7 t8 t9 tn

t6

Explanation in MJ’s example

5151

If I click on a link in a tweet, which version should I get?

ttweet or tclick ?

Hany SalahEldeen

5252

Sometimes you want a previous version

The Correct Temporal Intention

CNN.com at the closest time to the tweet: 25th June 2009 ~ 7pmHany SalahEldeen

5353

Sometimes you want the current version

The Correct Temporal Intention

In this case the current state of the press releases pageHany SalahEldeen

5454

Research Question

Can we estimate the users’ intention at the time of posting

and reading to predict and maintain temporal consistency?

Hany SalahEldeen

5555

People rely on social media for most updated information

Hany SalahEldeen

56Hany SalahEldeen

So if you are posting a tweet about your cat…

…No one cares!

57Hany SalahEldeen

Regardless how cool your cat was!

58

All tweets are equal…

…but some are more equal than the others

Hany SalahEldeen

59

Preliminary Research Questions:

1. How long would these last?2. And if lost, are they archived?3. Is this what the author intended?

Hany SalahEldeen

6060

Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.

Hany SalahEldeen

Historical Integrity

6161

The life cycle of a social post

Hany SalahEldeen

6262

The life cycle of a social post

tweets

Hany SalahEldeen

6363

The life cycle of a social post

tweets Links to

Hany SalahEldeen

6464

The life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

Hany SalahEldeen

6565

The life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

Hany SalahEldeen

The resource has disappeared

6666

The life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

The resource has disappeared

The resource has changed

Hany SalahEldeen

6767

Same state the author intended

The Resource’s Possibilities

a bigger problem since the reader might not know.

What the reader

receives

The resource has disappeared

The resource has changed

Hany SalahEldeen

6868

We could lose the linked resource

Hany SalahEldeen

6969

The attack on the embassy was in February 2013

Or the resource could change

Hany SalahEldeen

7070

Why do we want to detect the Author’s Temporal Intention?

• Match: and convey the intended information.• Notify:– the author that the resource is prone to change.– the reader that the resource has changed.

• Preserve: the resource by pushing snapshots into the archive automatically.

• Retrieve: the closest archived version to maintain the consistency.

Hany SalahEldeen

7171

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

7272

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

7373

Estimating Web Archiving Coverage• Goal: Estimate how much of the public web is present in the public archives and

how many copies are available?• Action:

– Getting 4 different datasets from 4 different sources:• Search Engines Indices• Bit.ly• DMOZ• Delicious.

• Results: *

• Publications: – How much of the web is archived? JCDL '11– http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.htmlHany SalahEldeen

16%-79% Archived according to the source

7474

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

7575

The timeline of the resource

Hany SalahEldeen

http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html

7676

Timestamps Accumulation

Hany SalahEldeen

7777

Actual Vs. Estimated Dates

Hany SalahEldeen

• Successfully estimated the creation date >75% of the resources

• >33% we estimated the exact date

7878

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

79

• From Twitter, Websites, Books:• The Egyptian revolution

• From Twitter Only:• Stanford’s SNAP dataset:• Iranian elections• H1N1 virus outbreak• Michael Jackson’s death• Obama’s Nobel Peace Prize

• Twitter API:• The Syrian uprising

Six Socially Significant Events

Hany SalahEldeen

80

Resources Missing & Archived

Hany SalahEldeen

81

Revisiting after a year…

Hany SalahEldeen

• There is a nearly linear relationship between the amount missing from the web and time.

• After 1 year ~11% is gone, and 0.02% is lost every day

82

Measured Vs. Predicted

Hany SalahEldeen

83

First Attempts to Shared Content Replacement

Hany SalahEldeen

• We performed an experiment to gauge how many of the resources that are missing could be replaced with other similar resources.

• Collected a dataset with available resources which we assumed to be missing

• Used our method to extract the replacement resources

• Measured the similarity with the original resource

84

First Attempts to Shared Content Replacement

Hany SalahEldeen

We were able to extract another resource with >70% similarity to the missing resource in >40% of the cases

8585

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

8686

Temporal Intention Relevancy Model(TIRM)

Between ttweet and tclick:

The linked resource could have:• Changed• Not changed

The tweet and the linked resource could be:• Still relevant• No longer relevant

Hany SalahEldeen

8787

Resource is changed but relevant

• The resource changed• But it is still relevant

Intention: need the current version of the resource at any time

Hany SalahEldeen

8888

Relevancy and Intention Mapping

Current

Hany SalahEldeen

8989

Resource is changed and not relevant

Intention: need the past version of the resource at any time

• The resource changed• But it is no longer relevant

Hany SalahEldeen

9090

Past

Relevancy and Intention Mapping

Current

Hany SalahEldeen

9191

Resource is not changed and relevant

Intention: need the past version of the resource at any time

• The resource is not changed• And it is relevant

Hany SalahEldeen

9292

Past

Relevancy and Intention Mapping

Current

Past

Hany SalahEldeen

9393

Resource is not changed and not relevant

Intention: I am not sure which version of the resource I need

• The resource is not changed• But it is not relevant

Hany SalahEldeen

9494

Past

Relevancy and Intention Mapping

Current

Past Not Sure

Hany SalahEldeen

9595

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

9696

Feature extraction

• For each tweet we perform:– Link analysis– Social Media Mining– Archival Existence– Sentiment Analysis– Content Similarity– Entity Identification

Hany SalahEldeen

9797

1- Link analysis

• Since the tweets have embedded resources shortened by Bit.ly we can extract:– Total number of clicks– Hourly click logs– Creation dates– Referring websites– Referring countries

• We calculate the depth of the resource in relation to its domain (either it is a leaf node or a root page)– We calculated the number of backslashes in the resource’s URI

Hany SalahEldeen

9898

2- Social Media Mining

• Twitter:– Using Topsy.com’s API to

extract:• Total number of tweets.• The most recent 500.• Number of tweets by

influential users.

The collection of tweets extracted provided an extended context of the resource authored by users in the twittersphere.

Hany SalahEldeen

9999

2- Social Media Mining• Facebook:– Mined too for likes, shares, posts, and clicks related to each

resource.

Hany SalahEldeen

100100

3- Archival Existence• Using Memento Time

Maps we get:– Total mementos

available– Different archives count.– The closest archived

version to the tweet time.

Hany SalahEldeen

101101

4- Sentiment Analysis• Using NLTK libraries of natural language text processing• Extract the most prominent sentiment in the text

Hany SalahEldeen

102102

5- Content Similarity• Steps:– We download the content HTML using Lynx browser.– We apply boilerplate removal algorithm and full text extraction.– Calculate the cosine similarity between the two pages.

70% similarity

Hany SalahEldeen

103103

6- Entity Identification• By visual inspection we observed that the majority of tweets about

celebrities are related to current events.• We harvested Wikipedia for lists of actors, politicians, and athletes.• Checked the existence of a celebrity mention in the tweets.

Actor: Johnny Depp

Hany SalahEldeen

104104

The trained classifier

• From the feature extraction phase we extracted 39 different features to train the classifier.

• Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%

Hany SalahEldeen

105105

What’s Next for Hany?

• Finish up my dissertation• Defend.• Get a research/Data scientist position• Interests:– L3S Research Center Germany– Microsoft Research

Hany SalahEldeen

106106

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

Summary:

Email: hany@cs.odu.eduOffice: 3102Website: http://www.cs.odu.edu/~hany/Twitter: @hanysalaheldeen

top related