web archive content analysis: disaster events case study iipc 2015 general assembly stanford...
TRANSCRIPT
![Page 1: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/1.jpg)
Web Archive Content Analysis: Disaster Events Case Study
IIPC 2015 General Assembly Stanford University and Internet Archive
Mohamed Farag Dr. Edward A. Fox
[email protected], [email protected]
DLRL, CS @ Virginia TechApril 27 – May 1, 2015
![Page 2: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/2.jpg)
Acknowledgments
• Related Funding: – 2007-2008: NSF IIS-0736055, DL-VT416: A Digital Library
Testbed for Research Related to 4/16/2007 at Virginia Tech– 2009-2013: NSF IIS-0916733, Crisis, Tragedy, and Recovery
network (CTRnet)– 2013-2016: NSF IIS-1319578, Integrated Digital Event
Archive & Library (IDEAL)• The Internet Archive (Kristine Hanna, co-PI): – Heritrix crawler and other tools and support– Hosting the crawls and resulting archives
IDEAL team also includes Drs. Kavanaugh, Sheetz, and Shoemaker; and GRA Sunshin Lee
![Page 3: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/3.jpg)
Outline
• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future Work
![Page 4: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/4.jpg)
Building archives for events – 1Manual Curation
• We have created ~ 60 collections ( https://archive-it.org/organizations/156 )
• These collections are about disaster events: bombings, earthquakes, hurricanes, plane crashes, shootings, floods, fires
• Manual preparation of URLs and archiving using Archive-it service
![Page 5: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/5.jpg)
Sample Web CollectionsCollection Name No. of Seeds
Alabama University Shooting 116April 16 Archive 88Chile Earthquake 19Nevada air race crash 64China Floods 60Encephalitis (India) 59Hurricane Irene 70
![Page 6: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/6.jpg)
Building archives for events - 2Seeds from social media (Twitter)
• We created more than 600 tweet collections with ~ 1 billion tweets
• For each collection we extract URLs in the tweets, fetch webpages, and archive just those webpages
• Webpage collections are of two types:– Disaster events: shootings, earthquakes, plane
crashes, hurricanes, bombings, terrorism, floods, fire
– Community and political events
![Page 7: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/7.jpg)
Sample Tweet CollectionsCollection Keywords/Hashtags No. of Tweets Start dateHurricane Sandy hurricane sandy 3,219,383 2012-10-26Ebola #ebola 1,855,680 2014-07-30Ferguson shooting #Ferguson 1,580,479 2014-08-11
Thanksgiving #Thanksgiving 214,888 2014-11-20AirAsia Plane Crash #QZ8501 174,353 2014-12-30
Charlie Hebdo shooting #CharlieHebdo 451,009 2015-01-07Iran Talks #IranTalks 117,966 2015-04-02
For full list check: http://hadoop.dlib.vt.edu:81/twitter/
![Page 8: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/8.jpg)
Building archives for events - 2 Seeds from social media
Event
Collect Tweets
Tweet Collection
Extract URLs
Shortened URLs
Expand Original Webpages
Archive WARC
Index SOLR
Browse
Wayback
Search
Access
Keyword/Hashtag
Collect Archive/Organize/Analyze
![Page 9: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/9.jpg)
Building archives for events - 3 Focused Crawling
• Curator selects high quality seed URLs• Use Event Focused Crawler (EFC) to retrieve
webpages that are highly similar to those with the seed URLs
• Curator can configure EFC to adjust the number of webpages retrieved and the quality of retrieved webpages (similarity threshold)
![Page 10: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/10.jpg)
Building archives for events - 3 Focused Crawling
![Page 11: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/11.jpg)
Outline
• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future Work
![Page 12: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/12.jpg)
Event Model and Representation
• Modeling events– What happened, where, and when
• Information retrieval– Helps find What part (Vector Space/Probabilistic)
• Natural language processing– Helps find Where and When parts (Named Entity
Recognition)
![Page 13: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/13.jpg)
Event Model and Representation
• Educational activities– CS4984 Computational Linguistics (Fall 2014)– CS5604 Information Retrieval (Spring 2015)
• Equipment– Hadoop cluster with 20 data nodes– 612 RAM, 76 Cores, and 60 TB Disk
• Processing methods– Stanford Named Entity Recognition– Mahout routine for topic identification– Python programming for text analysis (Hadoop streaming)
![Page 14: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/14.jpg)
Outline
• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future Work
![Page 15: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/15.jpg)
Assessing archive quality using event model
• Approaches to textual and linguistic analysis of an archive– Frequent and important words in whole collection– Important sentences, sentences that have one or
more frequent words– Frequent entities (location and dates) extracted
from important sentences
![Page 16: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/16.jpg)
Assessing archive quality using event model
Aggregation
Named Entity
Recognition
SentenceTokenization
KeywordMatching
TextExtraction
Event Model
Topic: (t1,t2,..,tn)Location: (l1,l2,..,ln)Date: (d1,d2,…,dn)
Sentences Selected Sentences
Event Entities
Text ContentWebpages
Frequent Words
Frequency Analysis
![Page 17: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/17.jpg)
Example• Ebola Outbreak (22 documents)• Top 10 frequent words and top 2 sentences
which includes 2 or more frequent words
Frequent Words Important Sentences Extracted Entities
EbolaVirusDiseaseHealth2014AfricaWestAgoUniversityOutbreak
- Outbreak of Ebola virus disease in West Africa: third update, 1 August 2014. (7)
DATE: ['August 2014'], LOCATION: ['West Africa']
- ECDC (2014) Outbreak of Ebola virus disease in West Africa. (7)
LOCATION: [u'West Africa']
![Page 18: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/18.jpg)
Outline
• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future work
![Page 19: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/19.jpg)
Archive Quality Assessment
• http://nick.dlib.vt.edu/EventModel/• Input: – Existing collections, WARC file, Text file with list of
URLs• Frontend: HTML, Javascript/Dojo• Backend: Python, NLTK
![Page 20: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/20.jpg)
Sample Results
![Page 21: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/21.jpg)
![Page 22: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/22.jpg)
![Page 23: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/23.jpg)
Future Work
• Use event model to:– Summarize event collection (generate most
informative sentence)– Extract relevant parts from webpage
![Page 24: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/24.jpg)
Thank YouQuestions?
Mohamed FaragDr. Edward A. Fox
![Page 25: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/25.jpg)
IDEAL Interface
• http://nick.dlib.vt.edu/ideal/collections/index.php
• Collections– 11 events categories , 2 events each (Small and
Big size)– Total 1.6 M documents
• Services:– Search (keywords, web collections text)– Browse (Event categories and events metadata,
web and tweet collections)
![Page 26: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/26.jpg)
Technologies
• Search engine– Solr 4.9 (http://lucene.apache.org/solr/)
• Web Interface– Apache server– JavaScript - Solr library
(https://github.com/evolvingweb/ajax-solr/wiki )• Tweets archiving
– yourTwapperKeeper (https://github.com/540co/yourtwapperkeeper )
• Webpages archiving– Archive-it service from Internet Archive
(https://archive-it.org/ )
![Page 27: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/27.jpg)
CollectionsCategory/Collection Big Small
Accident Train derailment in Quebec Texas factory explosion
Bombing Boston bombing Somalia Blast
Community Blacksburg events Labor day and world cup 2014
Disease Outbreak Ebola encephalitis
Earthquake Turkey earthquake Virginia earthquake and others
Fire Brazil night club fire Texas wild fire
Flood Pakistan flood China flood and Islip 13 inch rain
Hurricane Hurricane Sandy Typhoon Haiyan
Plane Crash Russia Plane Crash Nevada air race crash
Shooting April 16 shooting Norway shooting and others
![Page 28: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/28.jpg)
Search Interface
![Page 29: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/29.jpg)
Searching Sandy
![Page 30: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/30.jpg)
Faceted SearchSearch all events under Fire
![Page 31: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/31.jpg)
Faceted SearchSearch Brazil Night Club Fire
![Page 32: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/32.jpg)
Browse Interface
![Page 33: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/33.jpg)
Select Event Type
![Page 34: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/34.jpg)
Select Event
![Page 35: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d025503460f949d5db1/html5/thumbnails/35.jpg)
Hurricane Events