the web is a mess: how i learnt to stop worrying and love web archiving. kristine hanna

54
The Web is a Mess How I learned to stop worrying and love web archiving

Upload: biblioteca-nacional-de-espana

Post on 15-Jan-2015

2.120 views

Category:

Technology


0 download

DESCRIPTION

Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.

TRANSCRIPT

Page 1: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

The Web is a Mess

How I learned to stop worrying and love web

archiving

Page 2: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

We are a Digital Library

Mission Statement: Universal access to all knowledge

o Founded by Brewster Kahle in San Francisco,

California in 1996

o Officially designated a Library by the State of California

in 2007

About Internet Archive About Internet Archive

Page 3: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna
Page 4: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

500,000

Books

Page 5: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

500,000

500,000

Books

Moving Images

Page 6: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://flickr.com/photos/marfis75/

500,000

500,000

1,000,000

Books

Moving Images

Audio Recordings

Page 7: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

500,000

500,000

1,000,000

2,000,000

Books

Moving Images

Audio Recordings

Hours of TV

Page 8: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

500,000

500,000

1,000,000

2,000,000

3,600,000

Books

Moving Images

Audio Recordings

Hours of TV

eBooks

Page 9: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

The Archive is accessible to the public via the website: www.archive.org o Started collecting content in 1996 o First web pages public available in 2001 o 347+ billion web pages o 200+ million websites o Almost every domain o Content in 140+ Languages o Collect a broad summary of the web every 30-60

days - approximately 10 billion pages per snapshot

Access to General Web Archive Access to General Archive

Page 10: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

What is Web Archiving?

Web archiving is the process of collecting portions of web content, preserving the collections, and then providing access to the archives - for use and re use.

Page 11: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

A web archive is a collection of archived URLs grouped by theme, event, subject area,

or web address.

A web archive contains as much as possible from the original resources and documents

the change over time. It is a priority to recreate the same experience a user would have had if they had visited the live site on

the day it was archived.

What is a Web Archive?

Page 12: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Who is archiving the web Who is web archiving?

Page 13: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Why are We Doing This?

• Web archives preserve the web. They act as the web equivalent of the archive or library. In this role, their mission is to acquire and preserve the web for future generations… ensuring its continued survival for future generations

• Billions of people around the world have grown accustomed to using the web as their primary resource to acquire information.

• The availability of this electronic information is taken for granted and it is a fallacy that if something is on the web it will be there forever.

• There’s an essential need for people to understand that the web represents who we are. It’s our culture and our social fabric, and we don’t want to lose it.

Page 14: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Why should we archive the web?

Page 15: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

How long does a website live? • A 1997 report in Scientific

American claims 44 days.

• A subsequent academic 2001 study in IEEE suggests 75 days.

• A 2003 Washington Post article indicates the number is 100 days.

• A 2013 study by Old Dominion University says that after the first year of publishing, nearly 11% of social media will be lost and after that we will continue to lose 0.02% per day

How long does a website live?

Page 16: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

• Create a thematic/topical web archive on a specific subject

• Capture ‘at risk’ content during a spontaneous event

• Fulfill organizational mandate to preserve institutional memory & history

• Archive state/local agency publications no longer deposited in print form

• Archive records to meet university and/or government retention policies.

• Collect content to act as a research service for scholars to turn to

• Capture social media sites as part of organizational records

• Collect web-based information to augment physical holdings.

• Archive online art ephemera

• End of Life/Closure

Web Archiving Use Cases

Page 17: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

How does web archiving work?

Page 18: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

What is a crawler?

A crawler is the software that captures and archives web pages. A crawler visits a page and indexes the content included therein

Page 19: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Some technical challenges in capturing content

• Technical: dynamic content utilize scripting languages (Flash and JavaScript). The web is a hodgepodge of technologies, some old and outdated, others at the cutting edge.

• Capturing social media sites has become necessary as the web is moving away from html and moving towards applications

• Explore other capture mechanisms besides using a traditional crawler resource: hybrid architecture/API/headless browsers

Page 20: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://www.chaitalag.com/new/s/tubig http://www.helenbrowngroup.com/2011/02/rescue-from-the-digital-firehose/gushing-firehose-by-joseph-robertson/

Amount of content that is being archived

Amount of data being created by content providers

Challenge: a lot of data

Page 21: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Challenge: How much to archive?

There Are LimiTs…

Page 22: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Challenge: What to archive?

…What is important to you? What do you want people to know about? What are your organization’s collecting activities? Vision?

Page 23: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Participant Poll

• Does any of this make any sense?

Page 24: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Managing Collections

Page 25: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Starting a Collection

Collection: A group of URLs crawled and organized around a common theme, topic or domain

Ask Yourself:

• What is the topic of this collection?

• What websites would you like to archive as part of this collection?

Page 26: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Collections Start with Seeds

• Seed: starting point URL for the crawler. The crawler will follow linked pages from your seed URL and archive them if they are ‘in scope’.

• Document: any file with a distinct URL (html, image, PDF, video, etc).

Page 27: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Some of our Partner’s Digital Collections

• Stanford University (Palo Alto California)

• American University in Cairo

• Biblioteca Nacional de España

Page 28: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Stanford University, Islamic & Middle Eastern Collection

Use Case: harvest and preserve Iranian Blogs

• Archiving over 300 blogs written by and for Iran and the Iranian people

• Includes coverage of 2009 Iranian elections and the current Middle East unrest

Page 29: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Stanford and New York Universities Islamic and Middle Eastern Collection

Page 30: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna
Page 31: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

American University of Cairo

Use Case: The American University in Cairo Web Archive collects, preserves, and provides access to the web content published by students, faculty, departments, and offices at AUC. The archive also collects Web documents that have long-term research or historical value.

Page 32: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

January 25th Revolution and University on the Square Demonstrators in Tahrir Square. Image courtesy of Ahmad and the American University in Cairo Rare Books and Special Collections Library.

Page 33: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Archivist Driven Captures Thank you to Egypt's youth and Facebook . Image courtesy of Martin and Amy Rowe and the American University in Cairo Rare Books and Special Collections Library.

Page 34: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Patron Driven Captures Screenshot of the University on the Square Contribution form. In addition to soliciting photos and videos, we asked content providers to websites, blogs, Twitter feeds, etc.

Page 35: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Archivist as Advocate Protester documenting the demonstrations in Tahrir Sqare. Image courtesy of Robeir Rasmy and the American University in Cairo Rare Books and Special Collections Library.

Page 36: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Breaking down the life cycle

• One of its top priorities as a memory institution is to consolidate whichever strategies lead to the integral preservation of Spanish Internet-published contents, in accordance with the library's mission as keeper and disseminator of Spanish culture.

• Commitment to its patrons, who expect the web archive to become a publicly and freely accessible key information source for the study of the 21st century.

Biblioteca Nacional de España

Page 37: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Breaking down the life cycle

Use cases:

• 2011 Election crawl

• 2012 Humanities crawl

• 2009-present .es domain crawls

• 2013 .es Broad Survey Crawl, visited the top level page of every web site registered to .es ( in partnership with Red.es)

• 2011-2013 Thematic curation (World cups, Olympics,Global Hunger)

Biblioteca Nacional de España

Page 38: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://www.udatleticoisleño.es

Page 39: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://www.facebook.com/eajpnv

Page 40: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://twitter.com/xalmar

• Archived wen page from Facebook and/or Flickr

Page 41: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://es.wikipedia.org/wiki/Partido_Pirata_(España)

Page 42: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://www.estrelladigital.es

Page 43: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://leer.es

Page 44: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://iuabierta.blogspot.com

Page 45: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Not available on the live web

Page 46: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

http://www.piratamadrid.es

Page 47: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Not available on the live web

Page 48: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Making sense of it all

• Web Archiving life cycle /model

• Internet Archive future objectives

– Social Media

– Distributed Content

– Visualization and analytical tools for more useful interaction

– Search

– Mobile platforms

– Enhanced Researcher Access

Page 49: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Web Archiving Life Cycle Model

Web Archiving Life Cycle Model white paper available: http://www.archive-it.org/publications

Page 50: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Breaking down the life cycle Outer layer:

• Vision and Objectives • Resources and Workflow • Access / Use / Reuse. • Preservation • Risk Management

Inner Circle:

• Appraisal and Selection. • Scoping • Data Capture • Storage and Organization • Quality Assurance and Analysis

Breaking down the life cycle

Page 51: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Participant Poll

• Are you confused yet?

I hope not. Happy to answer questions!

Page 52: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

The importance of web archiving

“As our digital world continues to grow at a breathtaking pace and more and more of our daily live occurs within its digital boundaries, we must ensure that web archives are there to preserve our collective global consciousness for future generations”

Kalev H. Leetaru, University of Illinois

Page 53: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Kristine Hanna, Director, Archiving Services

Internet Archive [email protected]

Thank you!

Page 54: The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna