web archiving in the uk: why, by whom, for whom?...what is web archiving? “deliberate and...

Post on 05-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Web archiving in the UK: why, by whom, for whom?

Dr Peter WebsterWebster Research and Consulting@pj_webster / @WebsterRandC

The web its own archive?

Open UK Web Archive 2004-13 comparison.@anjacks0n http://britishlibrary.typepad.co.uk/webarchive/2014/10/what-is-still-on-the-web-after-10-years-of-archiving-.html

Disappearing predictably

Disappearing unpredictably

What is web archiving?

“deliberate and purposive preservation of

web material” (Brügger, 2011)

• micro or macro

• element, page, website, web sphere, whole

• harvesting, screen capture, file delivery

• public, restricted, or no access

[ archive.org ]

National libraries

• 16 of 28 member states within EU

• Sweden the first (1996)

• US (Library of Congress), Canada, Australia,

New Zealand, Singapore, Japan, Chile

• some with legal deposit provision: Denmark

(2005); France (2006), UK (2013)

Legal deposit web archiving: characteristics

• broad domain crawl, plus selective

• definition of the nation varies

• types of content included varies

• access restrictions

• indemnity against legal risks

Selective harvesting

• in absence of NPLD, based on permissions

• part of the case for obtaining NPLD law

• key resources, eg. government, media

• events: elections, Olympics, Eurovision

• themes: political extremism, climate change

Why archive your own web?

• part of orderly management of closure

• fulfilment of legal obligation

• management of risk

• part of the corporate record

• as a service for future scholars

Government records

A lost archive?

A lost archive?

A lost archive?

Web archives in the UK

Temporal scope Content scope Access

Open UKWA 2004-present Selective Online

Legal Deposit UKWA

2013-present Comprehensive (for UK)

Onsite

JISC UK Domain Dataset

1996-2013 Comprehensive (for .uk)

Index only

UK Government Web Archive

1996-present UK government Online

Parliamentary Web Archive

2009-present UK parliament Online

Univ. of Oxford 2011-present University sites Online

Tricky areas

• IPR (including third parties)

• personal data

• the right to be forgotten

• database-driven content

• embedded streaming media

Outsourcing providers

Not-for-profit

• Archive-IT (part of Internet Archive)

• Internet Memory Research

Commercial

• Hanzo Archives [UK]

• OIA (Offline Web Archive) [Germany]

• Pagefreezer [Canada/Netherlands]

Ways to use the archived web

• URL search -> single page• Full-text search -> single page• Visualisation -> trend -> page

Changing aesthetic

gov.ie, captured by archive.org, 15 August 2000

Full-text search

webarchive.org.uk/shine - https://github.com/ukwa/shine/

Visualising trends: ngram

Ways to use the archived web

• URL search -> single page• Full-text search -> single page• Visualisation -> trend -> page

• Direct access to WARC• Derived datasets• API access

Derived datasets from the BL

From JISC UK Web Domain Dataset (1996-2010)

• File format profile• Geo-index• Crawled URL Index (CDX)• Host Link Graph

Public domain at data.webarchive.org.uk

[ http://www.webarchive.org.uk/ukwa/visualisation/ukwa.ds.2/fmt ]

[Wikimedia Commons, CC BY SA 2.0, by Brian (of Toronto)]

A media firestorm

[https://web.archive.org/web/20080211003812/http://www.newsoftheworld.co.uk/1002_sharia.shtml]

UK Host Link Graph (1996-2010)

2008 | newsimg.bbc.co.uk | youtube.com | 45

2008 | archbishopofyork.org.uk | flickr.com | 1

2002 | secularism.org.uk | geocities.com | 1

Public domain at: data.webarchive.org.uk

[https://web.archive.org/web/20080211003812/http://www.newsoftheworld.co.uk/1002_sharia.shtml]

Questions ? Peter Webster

peter@websterresearchconsulting.com

@pj_webster / @WebsterRandC

websterresearchconsulting.com

top related