modeling temporal intention in resource sharing
DESCRIPTION
Modeling Temporal Intention in Resource Sharing. Hany M. SalahEldeen & Michael L. Nelson. Old Dominion University. Department of Computer Science Web Science and Digital Libraries Lab. WADL 2013. Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013. - PowerPoint PPT PresentationTRANSCRIPT
Modeling Temporal Intention in Resource Sharing
Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013
Hany M. SalahEldeen & Michael L. NelsonOld Dominion University
Department of Computer ScienceWeb Science and Digital Libraries Lab.
WADL 2013
All tweets are equal…
…but some are more equal than the others
Hany SalahEldeen & Michael Nelson 01 Modeling Temporal Intention. WADL2013
Preliminary research questions:
1. How long would these last?2. And if lost, is there backup somewhere?3. Is this what the author intended?
Hany SalahEldeen & Michael Nelson 01 Modeling Temporal Intention. WADL2013
Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.
Hany SalahEldeen & Michael Nelson 02 Modeling Temporal Intention. WADL2013
Historical integrity
People rely on social media for most updated information
Hany SalahEldeen & Michael Nelson 03 Modeling Temporal Intention. WADL2013
The life cycle of a social post
Hany SalahEldeen & Michael Nelson 04 Modeling Temporal Intention. WADL2013
The life cycle of a social post
tweets
Hany SalahEldeen & Michael Nelson 04 Modeling Temporal Intention. WADL2013
The life cycle of a social post
tweets Links to
Hany SalahEldeen & Michael Nelson 04 Modeling Temporal Intention. WADL2013
The life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
Hany SalahEldeen & Michael Nelson 04 Modeling Temporal Intention. WADL2013
The life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
Hany SalahEldeen & Michael Nelson 04 Modeling Temporal Intention. WADL2013
The resource has disappeared
The life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
The resource has disappeared
The resource has changed
Hany SalahEldeen & Michael Nelson 04 Modeling Temporal Intention. WADL2013
Same state the author intended
Resource’s possibilitiesWhat the
reader receives
The resource has disappeared
The resource has changed
Hany SalahEldeen & Michael Nelson 05 Modeling Temporal Intention. WADL2013
Same state the author intended
Resource’s possibilities
a bigger problem since the reader might not know.
What the reader
receives
The resource has disappeared
The resource has changed
Hany SalahEldeen & Michael Nelson 05 Modeling Temporal Intention. WADL2013
We could lose the linked resource
Hany SalahEldeen & Michael Nelson 06 Modeling Temporal Intention. WADL2013
The attack on the embassy was in February 2013
Or the resource could change
Hany SalahEldeen & Michael Nelson 07 Modeling Temporal Intention. WADL2013
Why do we want to detect the Author’s Temporal Intention?
• Match: and convey the intended information.• Notify:– the author that the resource is prone to change.– the reader that the resource has changed.
• Preserve: the resource by pushing snapshots into the archive automatically.
• Retrieve: the closest archived version to maintain the consistency.
Hany SalahEldeen & Michael Nelson 08 Modeling Temporal Intention. WADL2013
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013
Estimating web archiving coverage• Goal: Estimate how much of the public web is present in the public archives and how
many copies are available?• Action:
– Getting 4 different datasets from 4 different sources:• Search Engines Indices• Bit.ly• DMOZ• Delicious.
• Results: *
• Publications: – How much of the web is archived? JCDL '11
* Table Courtesy of Ahmed AlSum JCDL 2011
Hany SalahEldeen & Michael Nelson 09 Modeling Temporal Intention. WADL2013
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013
The timeline of the resource
Hany SalahEldeen & Michael Nelson 10 Modeling Temporal Intention. WADL2013
Timestamps accumulation
Hany SalahEldeen & Michael Nelson 11 Modeling Temporal Intention. WADL2013
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013
• From Twitter, Websites, Books:• The Egyptian revolution.
• From Twitter Only:• Stanford’s SNAP dataset:• Iranian elections.• H1N1 virus outbreak.• Michael Jackson’s death.• Obama’s Nobel Peace Prize.
• Twitter API:• The Syrian uprising.
Six socially significant events
Hany SalahEldeen & Michael Nelson 12 Modeling Temporal Intention. WADL2013
Resources missing & archived
Hany SalahEldeen & Michael Nelson 13 Modeling Temporal Intention. WADL2013
Revisiting after a year…
Hany SalahEldeen & Michael Nelson 14 Modeling Temporal Intention. WADL2013
Measured vs. predicted
Hany SalahEldeen & Michael Nelson 15 Modeling Temporal Intention. WADL2013
Interesting phenomenon: reappearance on the live web and disappearance from
the archives
Hany SalahEldeen & Michael Nelson 16 Modeling Temporal Intention. WADL2013
Reappearing and disappearance predictions
Hany SalahEldeen & Michael Nelson 17 Modeling Temporal Intention. WADL2013
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013
Temporal Intention Relevancy Model( TIRM)
Between ttweet and tclick:
The linked resource could have:• Changed• Not changed
The tweet and the linked resource could be:• Still relevant• No longer relevant
Hany SalahEldeen & Michael Nelson 18 Modeling Temporal Intention. WADL2013
Resource is changed but relevant
• The resource changed• But it is still relevant
Intention: need the current version of the resource at any time
Hany SalahEldeen & Michael Nelson 19 Modeling Temporal Intention. WADL2013
Relevancy and intention mapping
Current
Hany SalahEldeen & Michael Nelson 20 Modeling Temporal Intention. WADL2013
Resource is changed and not relevant
Intention: need the past version of the resource at any time
• The resource changed• But it is no longer relevant
Hany SalahEldeen & Michael Nelson 21 Modeling Temporal Intention. WADL2013
Past
Relevancy and intention mapping
Current
Hany SalahEldeen & Michael Nelson 22 Modeling Temporal Intention. WADL2013
Resource is not changed and relevant
Intention: need the past version of the resource at any time
• The resource is not changed• And it is relevant
Hany SalahEldeen & Michael Nelson 23 Modeling Temporal Intention. WADL2013
Past
Relevancy and intention mapping
Current
Past
Hany SalahEldeen & Michael Nelson 24 Modeling Temporal Intention. WADL2013
Resource is not changed and not relevant
Intention: I am not sure which version of the resource I need
• The resource is not changed• But it is not relevant
Hany SalahEldeen & Michael Nelson 25 Modeling Temporal Intention. WADL2013
Past
Relevancy and intention mapping
Current
Past Not Sure
Hany SalahEldeen & Michael Nelson 26 Modeling Temporal Intention. WADL2013
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013
Feature extraction
• For each tweet we perform:– Link analysis– Social Media Mining– Archival Existence– Sentiment Analysis– Content Similarity– Entity Identification
Hany SalahEldeen & Michael Nelson 27 Modeling Temporal Intention. WADL2013
• To remove confusion we removed the close calls
898 instances remaining
Relevant Assignments 929 82.65%
Non-Relevant Assignments 195 17.35%
5 MT workers agreeing (5-0 split) 589 52.40%
4 MT workers agreeing (4-1 split) 309 27.49%
3 MT workers agreeing (3-2 close call split) 226 20.11%
Modeling and classification using Mechanical Turk
Hany SalahEldeen & Michael Nelson 28 Modeling Temporal Intention. WADL2013
The trained classifier
• From the feature extraction phase we extracted 39 different features to train the classifier.
• Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%
Hany SalahEldeen & Michael Nelson 29 Modeling Temporal Intention. WADL2013
Testing the model10-Fold Cross-Validation Testing
Classifier Mean Absolute Error
Root Mean Squared Error
Kappa Statistic
Incorrectly Classified %
Correctly Classified %
Cost sensitive classifier based on Random Forest
0.15 0.27 0.39 9.68% 90.32%
Classifier Precision Recall F-measure Class
Cost sensitive classifier based on Random Forest
0.930.53
0.960.37
0.950.44
RelevantNon-Relevant
Weighted Average 0.89 0.90 0.90
Hany SalahEldeen & Michael Nelson 30 Modeling Temporal Intention. WADL2013
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013
TimeLord Navigator
Hany SalahEldeen & Michael Nelson 31 Modeling Temporal Intention. WADL2013
Thanks!
Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013
Hany SalahEldeenWeb Science & Digital Libraries Old Dominion University
Email: [email protected]
@hanysalaheldeen
Hany SalahEldeen
TimeLord Navigator
Hany SalahEldeen & Michael Nelson Modeling Temporal Intention. WADL2013
Demo:
www.cnn.com
www.bbc.com
Evaluation
Hany SalahEldeen & Michael Nelson 13 Modeling Temporal Intention. WADL2013
Actual Vs. Estimated Dates
Hany SalahEldeen & Michael Nelson 14 Modeling Temporal Intention. WADL2013
Resources Missing & ArchivedCollection Percentage Missing Percentage Archived
23.49%H1N1 Outbreak 41.65%
36.24%Michael Jackson 39.45%
26.98%Iran 43.08%
24.59%Obama 47.87%
10.48%Egypt 20.18%
7.04%Syria 5.35%
31.62% 30.78%
24.47% 36.26%
25.64% 43.87%
26.15% 46.15%
Hany SalahEldeen & Michael Nelson 16 Modeling Temporal Intention. WADL2013
First Attempts to Shared Content Replacement
Hany SalahEldeen & Michael Nelson 22 Modeling Temporal Intention. WADL2013
Link analysis
• Since the tweets have embedded resources shortened by Bit.ly we can extract:– Total number of clicks– Hourly click logs– Creation dates– Referring websites– Referring countries.
• We calculate the depth of the resource in relation to its domain (either it is a leaf node or a root page)– We calculated the number of backslashes in the resource’s URI
Hany SalahEldeen & Michael Nelson 29 Modeling Temporal Intention. WADL2013
Social Media Mining
• Twitter:– Using Topsy.com’s API to
extract:• Total number of tweets.• The most recent 500.• Number of tweets by
influential users.
The collection of tweets extracted provided an extended context of the resource authored by users in the twittersphere.
Hany SalahEldeen & Michael Nelson 30 Modeling Temporal Intention. WADL2013
Social Media Mining• Facebook:– Mined too for likes, shares, posts, and clicks related to each
resource.
Hany SalahEldeen & Michael Nelson 31 Modeling Temporal Intention. WADL2013
Archival Existence• Using Memento Time
Maps we get:– Total mementos
available– Different archives count.– The closest archived
version to the tweet time.
Hany SalahEldeen & Michael Nelson 32 Modeling Temporal Intention. WADL2013
Sentiment Analysis• Using NLTK libraries of natural language text processing• Extract the most prominent sentiment in the text
Hany SalahEldeen & Michael Nelson 33 Modeling Temporal Intention. WADL2013
Content Similarity• Steps:– We download the content HTML using Lynx browser.– We apply boilerplate removal algorithm and full text extraction.– Calculate the cosine similarity between the two pages.
70% similarity
Hany SalahEldeen & Michael Nelson 35 Modeling Temporal Intention. WADL2013
Entity Identification• By visual inspection we observed that the majority of tweets about
celebrities are related to current events.• We harvested Wikipedia for lists of actors, politicians, and athletes.• Checked the existence of a celebrity mention in the tweets.
Actor: Johnny Depp
Hany SalahEldeen & Michael Nelson 36 Modeling Temporal Intention. WADL2013