a comparison between the performance of wayback...

44
A Comparison Between the Performance of Wayback Machines Fernando Melo [email protected]

Upload: others

Post on 03-Aug-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

A Comparison Between thePerformance of WaybackMachines

Fernando Melo [email protected]

Page 2: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Main reasons for this study

Outdated Wayback

Evaluate possible alternatives

Page 3: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

How does a Web archive work?

Page 4: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

(W)ARC

Wayback

2016

Live

Page

Crawl

Lucene

CDX

Index

Search

Replay

Page 5: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

What is a Wayback Machine?

Page 6: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

What is a Wayback Machine?

Software Component

Replay Archived Web Pages

Search by URL and Date

Page 7: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

What is a Wayback Machine?

Page 8: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

What is a Wayback Machine?

Page 9: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Common Wayback Machine Issues

Page 10: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Slow Replay

Page 11: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Not Found Errors

Page 12: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Not Found Errors

Page 13: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Not Found Errors

Page 14: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Live-Web Leaks

2010

Archived

Page

2016

Live

Page

link

Page 15: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Live-Web Leaks

2010

Archived

Page

2016

Live

Page

link

Page 16: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Live-Web Leaks

2010

Archived

Page

2010

Archived

Page

link

Page 17: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Live-Web Leaks

Original Web PageJuly 14th, 2012

Archived Web PageJuly 14th, 2012

Page 18: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Live-Web Leaks

Original Web PageJuly 14th, 2012

Archived Web PageJuly 14th, 2012

Page 19: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Live-Web Leaks

Original Web PageJuly 14th, 2012

Archived Web PageJuly 14th, 2012

Page 20: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Let’s evaluate the performance ofWayback Machine Software!

Page 21: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Wayback Machines

Arquivo.pt Wayback

OpenWayback

PyWb

Page 22: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Wayback Machines

Page 23: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Arquivo.pt Wayback

Derives from version 1.2.1 of Open Source Wayback Machine (2008)

Java

Used by Arquivo.pt

Outdated - Presents several replay issues

Page 24: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

PyWb Wayback

Developed by Ilya Kreymer

Python

Used by

http://rhizome.org

http://webrecorder.io

http://perma.cc

Page 25: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

OpenWayback

Released by the Internet Archive

Maintained by the IIPC

Java

Page 26: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

OpenWayback - Users

National and University Library of Iceland

The British Library

Archive-It Mirror @ ODU

Stanford Web Archive Portal

The Library of Congress

Bibliotheca Alexandrina

York University Digital Library

Bibliothèque nationale de France

University of North Texas Libraries

Page 27: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

The .EU Collection - 2014

Page 28: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

The .EU Collection - 2014

Domains can be sold to anyone with a valid address in the European Union

European Institutions, Online Shops, and Web Spam

250 million documents from 34 thousand seeds

6TB

Page 29: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Methodology

400 URLs from the .EU

WebPageTest service

4 Wayback Configurations

HAR – to record performance data

Page 30: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Methodology

Page 31: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Methodology

Page 32: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Methodology

Only test each URL once

Tested using WebPageTest public servers

Response timeout of 2 minutes

Error Code – Leak to the live Web

Page 33: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Wayback Specifications

Wayback Year

Arquivo Pwa Lucene 2008

PyWb CDX 2015

PyWb Pwa Lucene 2015

OpenWayback CDX 2015

Page 34: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Replay Quality – HTTP Status andError Codes

Page 35: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Results – Live Web Leaks

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Nu

mb

er o

f U

RL

s

Page 36: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Results – Timeout Error

0

200

400

600

800

1000

1200

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Nu

mb

er o

f U

RL

s

Page 37: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Results – 200 OK Status Code

0

5000

10000

15000

20000

25000

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Nu

mb

er o

f U

RL

s

Page 38: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Results – 404 Error HTTP Code

0

1000

2000

3000

4000

5000

6000

7000

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Nu

mb

er o

f U

RL

s

Page 39: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Results – Summary Table

Wayback Success Error Success/Error

Arquivo 3 930 17 711 0.22

PyWb CDX 19 415 7 082 2.74

PyWb PwaLucene

11 087 4 652 2.38

OpenWayback 13 068 4 668 2.80

Page 40: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Response Speed

Page 41: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Results – Average Load Time

0

5

10

15

20

25

30

35

40

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Tim

e (s

eco

nd

s)

Page 42: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Conclusions

PyWb presented the biggest number of 200 OK HTTP status codes

OpenWayback was the fastest Wayback

Replace or Update Arquivo.pt’s Wayback!

Page 43: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

Future Work

Test with older collections to evaluate the performance of Wayback Machine software

Test with private instance of WebpageTestserver to be able to execute more tests and to control the server workload

Page 44: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/.../a-comparison-between-the-performance-of-wayba… · Wayback Machine Software! Wayback Machines Arquivo.pt

References

https://github.com/Fernando-Melo/WaybackComparison