negotiating the archives of uk web space - netlab › wp-content › uploads › 2016 › 12 ›...
TRANSCRIPT
Negotiating the archives of UK web space
Jane Winters, Professor of Digital Humanities, School of Advanced Study, University of London
Workshop on National Webs, Aarhus, 8-9 December 2016
Jisc Domain Dataset (1996-
2013)
Legal Deposit Domain Crawl (2013-2016)
Open UKWA (2004-2016)
UK parliament web archive (2009-2016)
UK Government Web Archive (1996-2016)
Internet Archive (1996-2016)
Common Crawl (1999-2016)
Archive-It
Other national
web archives
Facts and figures I
• Jisc historical dataset 1996 to 6 April 2013
– 3,520,628,647 distinct records
– 65 terabytes
• 2014 domain crawl (.uk)
– 56TB data
– 2.5 billion webpages and other assets (including 4.7GB of viruses)
Facts and figures II
• UK Parliament Web Archive
– Three snapshots per year covering 30 sites (37 sites in the archive in total)
– 4.8TB data
• UK Government Web Archive
– 3,000+ websites,
– Twitter (65,000 tweets) and video archives
Internal inconsistencies
• UKGWA consists of data provided by IA 2003-4 (plus back catalogue to 1996); and by the Internet Memory Foundation from 2005 onwards (further complicated by membership of UKWAC)
• The BL annual domain crawl has failed differently each time it has run
• The ‘break’ between IA and nationally archived content
0
50
100
150
200
250
300
Text types Image types Application types Video types File types
Number of format types, 1996-1997
1996 1997
nexbri.demon.co.uk/local.gif 19970823153342 http://nexbri.demon.co.uk:80/local.gif image/gif 200 DFBOHMHZPPQSEAIGZGL5MTATRKVB3FGF - 40806909 DOTUK-HISTORICAL-1996-2010-GROUP-AK-XABCKD-20110428000000-00002.arc.gz
mirex.demon.co.uk/background3.gif 19970824013134 http://mirex.demon.co.uk:80/background3.gif image/* 200 Z2V3V4NZTEYL634PR4VPS7YWIVG7J4B4 - 40832067 DOTUK-HISTORICAL-1996-2010-GROUP-AK-XABCKD-20110428000000-00002.arc.gz
mirex.demon.co.uk/mirex.gif 19970824013153 http://mirex.demon.co.uk:80/mirex.gif image/* 200 KVZHDCQIPPU4T5TA6P4TCP2BAAJNSH6H - 40840076 DOTUK-HISTORICAL-1996-2010-GROUP-AK-XABCKD-20110428000000-00002.arc.gz
mirex.demon.co.uk/ibrowsenowanim.gif 19970824013315 http://mirex.demon.co.uk:80/IBrowseNowAnim.gif image/* 200 CQXESYZG2DMVYDISQDJVQCMDJAHD7YEK - 40860957 DOTUK-HISTORICAL-1996-2010-GROUP-AK-XABCKD-20110428000000-00002.arc.gz
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Number of .uk names registered, 1996-2008 (Nominet)
1,575,655
108,711
4,626 265 42 8,8300
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
.co.uk .org.uk .ltd.uk .plc.uk .net.uk .sch.uk
Breakdown of domain name registrations, 2000 (Nominet)
Acknowledgements
• BUDDAH project team – Jonathan Blaney, Niels Brügger, Josh Cowls, Helen Hockx-Yu, Andrew Jackson, Eric Meyer, Ralph Schroeder, Jason Webber, Peter Webster
• Bursary holders – Rowan Aust, Rona Cran, Richard Deswarte, Saskia Huc-Hepher, Alison Kay, Gareth Millward, Marta Musso, Harry Raffal, Lorna Richardson, Helen Taylor