link persistence, website persistence

45
Link Persistence, Website Persistence Nicholas Taylor @nullhandle May 28, 2013 Forward ” by Flickr user Hitchster under CC BY

Upload: nullhandle

Post on 08-May-2015

306 views

Category:

Technology


2 download

DESCRIPTION

Presentation on the discrepancy between measurements of link persistence and website persistence and why it matters.

TRANSCRIPT

Page 1: Link Persistence, Website Persistence

Link Persistence,Website

PersistenceNicholas

Taylor@nullhandle

May 28, 2013 “Forward” by Flickr user Hitchster under CC BY 2.0

Page 2: Link Persistence, Website Persistence

why preserve the web?

Page 7: Link Persistence, Website Persistence

variable (Sanderson, Phillips, and Van de

Sompel, 2011)• literature review of 17 studies• research focused on scholarly

citations• decay rates of 39-82%• over periods of 1-13 years

Page 8: Link Persistence, Website Persistence

“Digital documents last forever—or five years, whichever comes first.”

(Jeff Rothenberg, 1997)

“Out of books sprout... plants” by DeviantArt user quinn.anya under CC BY-SA 2.0

Page 9: Link Persistence, Website Persistence

LINK CHECKING

The Art and Science of

“http Blue Background” by DeviantArt user SoulArt2012 under CC BY-NC-ND 3.0

Page 10: Link Persistence, Website Persistence

http response codes

• 404: “Not Found”• 200: “OK”• 301: “Moved Permanently”• 500: “Internal Server Error”

Page 11: Link Persistence, Website Persistence

automated link checker

“La Machine @ Yokohama” by Flickr user chidorian under CC BY-SA 2.0

Page 13: Link Persistence, Website Persistence

possible scenarios

• link works; same website• link works; different website

– website may or may not still exist• link doesn’t work; website still exists• link doesn’t work; website no longer

exists

Page 14: Link Persistence, Website Persistence

link works; same websitehttp://www.fair.org/ (2002)

http://www.fair.org/ (2013)

Page 15: Link Persistence, Website Persistence

link works; different website…

http://www.fb.com/ (2002)

http://www.fb.com/ (2013)

Page 16: Link Persistence, Website Persistence

…but website still existshttp://www.fb.org/ (2013)

Page 17: Link Persistence, Website Persistence

link doesn’t work…

http://www.state.mo.us/ (2002)

http://www.state.mo.us/ (2013)

Page 18: Link Persistence, Website Persistence

…but website still existshttp://www.sos.mo.gov/ (2013)

Page 19: Link Persistence, Website Persistence

link doesn’t work;website no longer exists

Page 20: Link Persistence, Website Persistence

assumptions

• link works; same website• link works; different website

– website may or may not still exist• link doesn’t work; website still exists• link doesn’t work; website no

longer exists

Page 21: Link Persistence, Website Persistence

research questions

• how much are we overestimating website persistence?– some working links point to different

websites• how much are we underestimating

website persistence?– websites may still exist even though

links don’t work or do work but point to different websites

Page 22: Link Persistence, Website Persistence

WEB ARCHIVES

A Study Using

Page 23: Link Persistence, Website Persistence

Library of CongressU.S. Election 2002 Web Archive

Page 24: Link Persistence, Website Persistence

preparing the list of links

• exclude links corresponding to electoral candidate websites

• 1,071 links– state government– political parties– advocacy organizations– major newspapers– political blogs

Page 25: Link Persistence, Website Persistence

methodology

automated• run Heritrix against

links, ignoring robots.txt

• log http response codes

• log redirects

manual• manually check each

link• same website behind

working link?• does website still

exist?

Page 26: Link Persistence, Website Persistence

methodology

automated• run Heritrix against

links, ignoring robots.txt

• log http response codes

• log redirects

manual• manually check each

link• same website behind

working link?• does website still

exist?

Page 27: Link Persistence, Website Persistence

working link?

91%

9%

workingnon-working

Page 28: Link Persistence, Website Persistence

same website?

83%

9%

8%

working link; same site

non-working link

Page 29: Link Persistence, Website Persistence

non-working link;website still exists?

91%

8%

2%

workingstill existsdoesn't exist

Page 30: Link Persistence, Website Persistence

website still exists?

94%

6%

still existsdoesn't exist

Page 31: Link Persistence, Website Persistence

summary of results

• how much are we overestimating website persistence?– 8% of working links point to different

websites• how much are we underestimating

website persistence?– 82% of websites associated with non-

working links still exist– 48% of websites whose links now point

to different websites still exist

Page 32: Link Persistence, Website Persistence

what does it mean?

• websites are (much more) persistent than links

• websites are surprisingly durable?

“Golden Spider Silk” by Flickr user amandabhslater under CC BY-SA 2.0

Page 33: Link Persistence, Website Persistence

WEBSITE CHECKING?

Beyond Link Checking,

“Check” by Flickr user ex.libris under CC BY-NC-ND 2.0

Page 34: Link Persistence, Website Persistence

building a website checker

1. check whether link still works2. check whether link still corresponds

to website3. check whether website still exists

Page 35: Link Persistence, Website Persistence

“Most web archiving problems are problems of scale.”

(Kris Carpenter Negulescu, 2012)

“chutes and ladders” by Flickr user reallyboring under CC BY-NC-SA 2.0

Page 36: Link Persistence, Website Persistence

building a website checker

1. check whether link still works2. check whether link still corresponds

to website3. check whether website still exists

Page 38: Link Persistence, Website Persistence

…but checksums are limited

“Hashing Emily” by Flickr user wlef70 under CC BY-NC-SA 3.0

Page 39: Link Persistence, Website Persistence

visual analysis of page changes

Pehlivan, Ben-Saad, and Gançarski: “Vi-DIFF: Understanding Web Pages Changes”

Page 40: Link Persistence, Website Persistence

building a website checker

1. check whether link still works2. check whether link still corresponds

to website3. check whether website still exists

Page 41: Link Persistence, Website Persistence

lexical signature of archived page

Ware, Klein, and Nelson: “An Evaluation of Link Neighborhood Lexical Signatures to Rediscover Missing Web Pages”

Page 42: Link Persistence, Website Persistence

find archived pages w/ Memento

• http protocol enhancement

• enables discovery of archived resources in distributed web archives

Page 43: Link Persistence, Website Persistence

lexical signatures of backlink pages

Page 44: Link Persistence, Website Persistence

“The future is already here; it’s just not very evenly distributed.”

(William Gibson, 1999)

“Time Travel” by Flickr user xcalibr under CC BY-NC-ND 2.0