recent approaches to capture web content, which heritrix can’t harvest capturing social media ...
TRANSCRIPT
Recent approaches to capture web content, which Heritrix can’t harvest
Capturing Social Media Screen filming of Rich Media Project: Event crawl of The Eurovision Song Contest in
Copenhagen 2014 Cooperation with researchers
NAS workshop, Paris 2014/Sabine Schostag
Why focus on social media?
Nowadays social media are the primary communication platforms during cultural and political events
Politicians, artists, musicians, even the traditional news media such as TV – use the social media more than traditional web pages
The entries on social media pages are ephemeral, so we need to capture them in a very high frequency
NAS workshop, Paris 2014/Sabine Schostag
Which social media did we crawl?
Twitter.com comments Youtube.com video and comments Facebook.com comments Live blogs
Excluded for technical reasons …
instagram.com video and image tumblr.com multimedia blog flickr.com images vimeo.com video
NAS workshop, Paris 2014/Sabine Schostag
Which Tools did we use? Harvesting with NetarchiveSuite using Heritrix 1.4* ,
weekly, daily and hourly ”Crontab” based screen dumping of static url’s using
PhantomJS to searchable PDF’s Manually LAP (Live Archive Program) browsing XML Extracts from API’s using own developed tools
and/or Digitalfootprints.dk Harvesting YouTube videos by extracting the video url’s
from the “watch-url” pages with own developed tool Screenrecording using CamStudio.org and a Netlab.dk
linux tool wrapping ”ffmpeg”
NAS workshop, Paris 2014/Sabine Schostag
…more about the automated screen filming tool developed as part of research project by
curator/researcher, now implemented as a tool allows scheduled capturing is well suited to capture pre-planned streamed content is well suited to capture frequently updated content which
refreshes automatically (no mouseclicks) is not a replacement for existing collection methods, but a
supplement
NAS workshop, Paris 2014/Sabine Schostag
…more about the automated screen filming tool
The tool enables the user to programme every mouseclick, every interaction on the webpage
NAS workshop, Paris 2014/Sabine Schostag
…some screenshots from the filming tool
NAS workshop, Paris 2014/Sabine Schostag
ESC 2004 and the European Parliament Elections 2014
Lessons learned NetarchiveSuite using Heritrix 1.4* can’t harvest js with AJAX and
the high frequency of feeds f.x. 47.000 tweets/minut. You can record the ”look and feel” with screen recording and
dumping, but it is a HUGE manual work producing files and provenance documentation outside the archive.
The LAP tool is not rather useful as it doesn’t support https (most of the social media use https today).
”Digitalfootprints.dk” can archive almost all XML content for twitter and could be harvested afterwards by NetarchiveSuite Heritrix.
NAS workshop, Paris 2014/Sabine Schostag
Current issues
wider access better access (free text search) inclusion of older net collections collection of websites with restricted access advanced web content, ie. with
sound/video/live interaction (chat, virtual worlds …)
electronic communication networks ≠ the web
long-term preservation documentation
NAS workshop, Paris 2014/Sabine Schostag
… and from the techical point of view more stable and operational screen recording and dumping
tools for huge social media events build social media API extract plugins into Heritrix and better
support for WARC linking of e.g. Youtube watch and video download url’s.
Build scripting and https support into the LAP-tool. upgrade NetarchiveSuite to Heritrix 3.* to better support js
with AJAX (using the Umbra plugin) and continuously crawling.
NAS workshop, Paris 2014/Sabine Schostag
Epilogue
For the first time in Netarchive’s history the whole team met for to days
NAS workshop, Paris 2014/Sabine Schostag