warcreate - create wayback-consumable warc files from any webpage

25
July 25, 2012 Arlington, Virginia Digital Preservation 2012 warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele C. Weigle, Michael L. Nelson {mkelly,mweigle,mln}@cs.odu.edu Old Dominion University; Norfolk, VA

Upload: mat-kelly

Post on 08-May-2015

3.513 views

Category:

Technology


5 download

DESCRIPTION

The Internet Archive's Wayback Machine is the most common way that typical users interact with web archives. The Internet Archive uses the Heritrix web crawler to transform pages on the publicly available web into Web ARChive (WARC) files, which can then be accessed using the Wayback Machine. Because Heritrix can only access the publicly available web, many personal pages (e.g., password-protected pages, social media pages) cannot be easily archivedinto the standard WARC format. We have created a GoogleChrome extension,WARCreate, that allows a user to createa WARC file from any webpage. Using this tool, content that might have been otherwise lost in time can be archived in a standard format by any user. This tool provides a way for casual users to easily create archives of personal onlinecontent. This is one of the fi rst steps in resolving issues oflong term storage, maintenance, and access of personal digital assets that have emotional, intellectual, and historicalvalue to individuals

TRANSCRIPT

Page 1: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

WARCreateCreate Wayback-Consumable WARC Files from Any Webpage

Mat Kelly, Michele C. Weigle, Michael L. Nelson{mkelly,mweigle,mln}@cs.odu.edu

Old Dominion University; Norfolk, VA

Page 2: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

2

What is WARCreate?

• Google Chrome extension• Creates WARC files• Enables preservation by users from their

browser• First steps in bringing Institutional

Archiving facilities to the PC

Page 3: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

3

Target Content

• Unreachable by web crawlers– Behind authentication– Not listed in search engines (Deep Web)

• Private– We don’t want our bank statements in Wayback

• Non-pertinent to public– Others have little interest in our Facebook

comments

Page 4: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

4

Preserving More!

• Much digital information is needlessly lost

• User chooses what they deem important

• Compatible with standard archiving tools.

Page 5: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

5

WYSIWYG

Facebook-Supplied Data DumpArchive created from

WARCreate in Wayback

Page 6: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

6

WYSIWYG

Using Scraping Tools (e.g. wget)Archive created from

WARCreate in Wayback

Page 7: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

7

WYSIWYG

A Crawler Has No ContextArchive created from

WARCreate in Wayback

Page 8: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

8

WYSIWYG

IA/HERITRIX OBEY ROBOTSArchive created from

WARCreate in Wayback

Page 9: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

9

Goals

• Make it easy to use (GUI-based, no cmd line)• Make it useful (fill the need)• Demonstrate novelty of browser-instigated

preservation• Show value of WARC format for Personal Web

preservation• Bring WARC format to Personal Digital

Archiving

Page 10: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

11Creating a WARC

Page 11: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

12

I’ve Made a WARC. Now what?

• What you do with the archive is up to you.– Install it in your local Wayback instance

• Who has their own Wayback Instance!?– Wayback is free & open source

• That seems like a lot of work!– One additional reason for users NOT to preserve

what they would like archived

Page 12: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

13

…to directory accessible to local wayback

6

WARC Creation & Replay

1. User visits a website using their browser

1

2

4

3

2. WARCreate captures the HTTP Headers3. User Selects “Generate WARC” button in WARCreate4. WARC generated, saved locally

5

5. Local Wayback instance indexes WARC6. User accesses local wayback to view preserved content

Page 13: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

14

Suite Installation & Interaction

• Drag & Drop .zip to hd

• Start relevant servicesusing GUI

• Execute WARCreate process

• View Archive at http://localhost/wayback

Page 14: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

15

Replay of Preserved Twitter page

Page 15: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

16

And My Bank Statements?

• Preserved content:– never leaves WARC files– never leaves local machine

• WARCreate provides preliminary encoding/encryption support

• Wayback instance is hosted on your own machine – no external access by default

Page 16: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

18

Why Use a Client-Side Server?

• Server scripts do what JS can’t• Can reside on your machine!• Controls are GUI based• Resource fetching w/o XSS issues

Local Wayback InstanceWARCreate Server-Side

Support

Memento Proxy

… Tomcat Apache

XAMPP-Based Personal Web Archiving Suite

Built On

Page 17: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

19

Extras: Memento Support

• Suite’s includes tailored Timegate

• Memento abstraction is beyond WARC

• Point MementoFox (or other Memento tools) to localhost

Page 18: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

20

How it All Relates

WARCreate

BROWSER

MementoFox

Browser Extensions

WARC/1.0WARC-Type: warcinfo WARC-Date: 2012-07-15T22:15:59.485ZWARC-Filename: 2220471175c820fee3fec986040ebd1f.warc

Generates WARC file

LocalTimegate

LocalWaybackInstance

Send Desired Date

Index WARCs

Memento negotiated& returned

Personal Archives Accessible at localhost

Page 19: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

21

Contribution of Work

• Facilitate browser-based Personal Web Archiving

• Determine feasibility of fully Client-Side Preservation

• Integrate with existing tools for establishing use cases

Page 20: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

22

WARCreateCreate Wayback-Consumable WARC Files from Any Webpage

http://WARCreate.com

Page 21: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

23

Backup Slides

Page 22: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

24

Future Work

• Decouple from “server”• Refine Memento integration• Reference full WARC spec• Built-in WARC validation• Built-in replay• Compression• Optimization (removing duplicates)• …many more

Page 23: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

25

Extras: Configuration Sanity Check

• Server scipts make up for Javascript shortcomings

• The server can reside on your machine!

• Setup,Start,Stop are GUI based

✗✗✗✗✗

WARC ValidationAJAX XSS CircumventionHTML5 Sandbox EscapingMemento SupportLocal Wayback Instance

In WARCreate

Page 24: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

26

Extras: Configuration Sanity Check

+ Apache allows generatedWARCs to be validated

+ Javascript cannot write todisk, server-side scripts can

+ Server prevents hot-linking & has security

= Content better preserved using server techs

✓✓✓ ? ✗

WARC ValidationAJAX XSS CircumventionHTML5 Sandbox EscapingMemento SupportLocal Wayback Instance

In WARCreate

Page 25: WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

July 25, 2012Arlington, Virginia Digital Preservation 2012 warcreate.com

27

Extras: Configuration Sanity Check

• Memento requires Wayback Wayback requires Tomcat

∴ Memento requires Tomcat• Memento Timegate req’s

Python+modules(pre-packaged + included)

✓✓✓✓✓

WARC ValidationAJAX XSS CircumventionHTML5 Sandbox EscapingMemento SupportLocal Wayback Instance

In WARCreate