the web is a mess: how i learnt to stop worrying and love web archiving. kristine hanna

The Web is a Mess

How I learned to stop worrying and love web

archiving

We are a Digital Library

Mission Statement: Universal access to all knowledge

o Founded by Brewster Kahle in San Francisco,

California in 1996

o Officially designated a Library by the State of California

in 2007

About Internet Archive About Internet Archive

500,000

Books

500,000

500,000

Books

Moving Images

http://flickr.com/photos/marfis75/

500,000

500,000

1,000,000

Books

Moving Images

Audio Recordings

500,000

500,000

1,000,000

2,000,000

Books

Moving Images

Audio Recordings

Hours of TV

500,000

500,000

1,000,000

2,000,000

3,600,000

Books

Moving Images

Audio Recordings

Hours of TV

eBooks

The Archive is accessible to the public via the website: www.archive.org o Started collecting content in 1996 o First web pages public available in 2001 o 347+ billion web pages o 200+ million websites o Almost every domain o Content in 140+ Languages o Collect a broad summary of the web every 30-60

days - approximately 10 billion pages per snapshot

Access to General Web Archive Access to General Archive

http://www.archive.org

What is Web Archiving?

Web archiving is the process of collecting portions of web content, preserving the collections, and then providing access to the archives - for use and re use.

A web archive is a collection of archived URLs grouped by theme, event, subject area,

or web address.

A web archive contains as much as possible from the original resources and documents

the change over time. It is a priority to recreate the same experience a user would have had if they had visited the live site on

the day it was archived.

What is a Web Archive?

Who is archiving the web Who is web archiving?

Why are We Doing This?

• Web archives preserve the web. They act as the web equivalent of the archive or library. In this role, their mission is to acquire and preserve the web for future generations… ensuring its continued survival for future generations

• Billions of people around the world have grown accustomed to using the web as their primary resource to acquire information.

• The availability of this electronic information is taken for granted and it is a fallacy that if something is on the web it will be there forever.

• There’s an essential need for people to understand that the web represents who we are. It’s our culture and our social fabric, and we don’t want to lose it.

Why should we archive the web?

How long does a website live? • A 1997 report in Scientific

American claims 44 days.

• A subsequent academic 2001 study in IEEE suggests 75 days.

• A 2003 Washington Post article indicates the number is 100 days.

• A 2013 study by Old Dominion University says that after the first year of publishing, nearly 11% of social media will be lost and after that we will continue to lose 0.02% per day

How long does a website live?

• Create a thematic/topical web archive on a specific subject

• Capture ‘at risk’ content during a spontaneous event

• Fulfill organizational mandate to preserve institutional memory & history

• Archive state/local agency publications no longer deposited in print form

• Archive records to meet university and/or government retention policies.

• Collect content to act as a research service for scholars to turn to

• Capture social media sites as part of organizational records

• Collect web-based information to augment physical holdings.

• Archive online art ephemera

• End of Life/Closure

Web Archiving Use Cases

How does web archiving work?

What is a crawler?

A crawler is the software that captures and archives web pages. A crawler visits a page and indexes the content included therein

Some technical challenges in capturing content

• Technical: dynamic content utilize scripting languages (Flash and JavaScript). The web is a hodgepodge of technologies, some old and outdated, others at the cutting edge.

• Capturing social media sites has become necessary as the web is moving away from html and moving towards applications

• Explore other capture mechanisms besides using a traditional crawler resource: hybrid architecture/API/headless browsers

http://www.chaitalag.com/new/s/tubig http://www.helenbrowngroup.com/2011/02/rescue-from-the-digital-firehose/gushing-firehose-by-joseph-robertson/

Amount of content that is being archived

Amount of data being created by content providers

Challenge: a lot of data

Challenge: How much to archive?

There Are LimiTs…

Challenge: What to archive?

…What is important to you? What do you want people to know about? What are your organization’s collecting activities? Vision?

Participant Poll

• Does any of this make any sense?

Managing Collections

Starting a Collection

Collection: A group of URLs crawled and organized around a common theme, topic or domain

Ask Yourself:

• What is the topic of this collection?

• What websites would you like to archive as part of this collection?

Collections Start with Seeds

• Seed: starting point URL for the crawler. The crawler will follow linked pages from your seed URL and archive them if they are ‘in scope’.

• Document: any file with a distinct URL (html, image, PDF, video, etc).

Some of our Partner’s Digital Collections

• Stanford University (Palo Alto California)

• American University in Cairo

• Biblioteca Nacional de España

Stanford University, Islamic & Middle Eastern Collection

Use Case: harvest and preserve Iranian Blogs

• Archiving over 300 blogs written by and for Iran and the Iranian people

• Includes coverage of 2009 Iranian elections and the current Middle East unrest

Stanford and New York Universities Islamic and Middle Eastern Collection

American University of Cairo

Use Case: The American University in Cairo Web Archive collects, preserves, and provides access to the web content published by students, faculty, departments, and offices at AUC. The archive also collects Web documents that have long-term research or historical value.

January 25th Revolution and University on the Square Demonstrators in Tahrir Square. Image courtesy of Ahmad and the American University in Cairo Rare Books and Special Collections Library.

Archivist Driven Captures Thank you to Egypt's youth and Facebook . Image courtesy of Martin and Amy Rowe and the American University in Cairo Rare Books and Special Collections Library.

Patron Driven Captures Screenshot of the University on the Square Contribution form. In addition to soliciting photos and videos, we asked content providers to websites, blogs, Twitter feeds, etc.

Archivist as Advocate Protester documenting the demonstrations in Tahrir Sqare. Image courtesy of Robeir Rasmy and the American University in Cairo Rare Books and Special Collections Library.

Breaking down the life cycle

• One of its top priorities as a memory institution is to consolidate whichever strategies lead to the integral preservation of Spanish Internet-published contents, in accordance with the library's mission as keeper and disseminator of Spanish culture.

• Commitment to its patrons, who expect the web archive to become a publicly and freely accessible key information source for the study of the 21st century.

Biblioteca Nacional de España


Use cases:

• 2011 Election crawl

• 2012 Humanities crawl

• 2009-present .es domain crawls

• 2013 .es Broad Survey Crawl, visited the top level page of every web site registered to .es ( in partnership with Red.es)

• 2011-2013 Thematic curation (World cups, Olympics,Global Hunger)

Biblioteca Nacional de España

http://www.udatleticoisleño.es

http://www.facebook.com/eajpnv

http://twitter.com/xalmar

• Archived wen page from Facebook and/or Flickr

http://es.wikipedia.org/wiki/Partido_Pirata_(España)

http://www.estrelladigital.es

http://leer.es

http://iuabierta.blogspot.com

Not available on the live web

http://www.piratamadrid.es

Not available on the live web

Making sense of it all

• Web Archiving life cycle /model

• Internet Archive future objectives

– Social Media

– Distributed Content

– Visualization and analytical tools for more useful interaction

– Search

– Mobile platforms

– Enhanced Researcher Access

Web Archiving Life Cycle Model

Web Archiving Life Cycle Model white paper available: http://www.archive-it.org/publications

http://www.archive-it.org/publications



Breaking down the life cycle Outer layer:

• Vision and Objectives • Resources and Workflow • Access / Use / Reuse. • Preservation • Risk Management

Inner Circle:

• Appraisal and Selection. • Scoping • Data Capture • Storage and Organization • Quality Assurance and Analysis


Participant Poll

• Are you confused yet?

I hope not. Happy to answer questions!

The importance of web archiving

“As our digital world continues to grow at a breathtaking pace and more and more of our daily live occurs within its digital boundaries, we must ensure that web archives are there to preserve our collective global consciousness for future generations”

Kalev H. Leetaru, University of Illinois

Kristine Hanna, Director, Archiving Services

Internet Archive [email protected]

Thank you!

the web is a mess: how i learnt to stop worrying and love web archiving. kristine hanna

Technology

web archives

thematictopical web

web address

web equivalent

web pages o

portions of web content

general web archive

archives web pages