1 archiving and preserving the web kristine hanna internet archive april 2006
TRANSCRIPT
![Page 1: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/1.jpg)
1
Archiving and Preserving the WebKristine Hanna
Internet Archive
April 2006
![Page 2: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/2.jpg)
2
Internet Archive Universal Access to Human Knowledge
• a 501(c)(3) non-profit
• Located in Presidio, San Francisco California
• Founded in 1996 to build an ‘Internet library’
• Provide permanent access for researchers, historians, and scholars to historical collections that exist in digital format.
• Built on open source principles
• Open Source software developed by Internet Archive and the IIPC
![Page 3: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/3.jpg)
3
Internet Archive Stats
• Largest public web archive• 60 billion pages, 55 million sites• Have expanded to include texts, audio, moving
images, and software: 2.6 million downloads a day
• 60,000 unique users a day
![Page 4: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/4.jpg)
4
What do we collect?Web Archive
• Take a broad snapshot of the web every 2 months • 2 billion pages a month• Websites from every domain (.org, .com, .edu etc)• Content in 21 languages• Entire archive accessible for free to the public via
the website at www.archive.org
![Page 5: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/5.jpg)
5
Why try to collect and preserve it all?
• Web has no boundaries, no limits• What will be important to future generations?• What is there today may be gone tomorrow
– “Capture now, ask why later”– “Grab it while you can, work it out later”– “Lose as little as possible”
![Page 6: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/6.jpg)
6
Open Source Technology primarily developed by Internet Archive and IIPC
• Heritrix: web crawler• Wayback Machine: access tool for rendering and
viewing files• Nutch and Nutchwax: Search engine• Arc File: archival record format (ISO work item)
How do we collect it?
![Page 7: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/7.jpg)
7
Wayback Machine
![Page 8: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/8.jpg)
8
Preservation
• Store multiple copies of each Archive
• 1300 machines/servers
• Multiple copies at different geographical locations (U.S. Alexandria, Amsterdam)
• Standard storage boxes, open source design
![Page 9: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/9.jpg)
9
Archiving Next Steps
Institutions:• need to create collections around web
material • want to dig deeper in crawls for their
specific websites. • Want more control and access• want a technology partner that could harvest,
index, access, store and preserve their collections for them.
![Page 10: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/10.jpg)
10
• In 2002, began to form partnerships with Library of Congress, NARA and other National Libraries, including Australia, France and Italy
– Dedicated Crawl Engineer
- Customized crawling
• Library of Congress collections: (sample)
• Iraq War: 450 Million documents and growing
• 2004: U.S. National Elections: 88 Million documents
• Supreme Court Nomination 2005: 100 Million documents
1. Partner Contract Crawls
![Page 11: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/11.jpg)
11
• Last year, early 2005, we had requests from state archivists, university librarians and other memory institutions:– develop an application for smaller institutions, that
have some resource constraints– A web based service that allows partners to create,
manage, search and store their web archives – User friendly web interface– Does not require technical expertise or infrastructure
• Pilot launched in September 2005
2. Archive-It
![Page 12: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/12.jpg)
12
Pilot Partners
• Center for Research Libraries• Research Libraries Group ( U of Toronto, U of Indiana,
Haverford and Swarthmore Colleges, IISH)• University of Texas• Library of Virginia• State Archives South Dakota• State Archives North Carolina• State Archives Alabama• Minnesota Historical Society• Institut d'Etude Politique de Grenoble
![Page 13: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/13.jpg)
13
Archive-It Access
• All collections are accessible for free to the general public, with text search, at:– www.archiveit. org– Partners websites with links
• Plus, member web application with login
![Page 14: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/14.jpg)
14
Screen shot here
• Public site
![Page 15: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/15.jpg)
15
Test Drive the Application
![Page 16: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/16.jpg)
![Page 17: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/17.jpg)
![Page 18: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/18.jpg)
18
![Page 19: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/19.jpg)
19
Screen shots here
• Monitor page
• Reports page
• XML feed
![Page 20: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/20.jpg)
20
![Page 21: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/21.jpg)
• Search– Your archived web pages are searchable by text or
URL
![Page 22: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/22.jpg)
22
• Stored Online
• We provide copies of the files in a hard drive that we can ship to your institution up to 2x a year
![Page 23: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/23.jpg)
23
Archive-It Releases
• 1.0 (February 8)
• 1.5 (April 19)
• 2.0 (July 29)
![Page 24: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/24.jpg)
24
Challenges we face
• Making the collections useful for a variety of end users (i.e. general public, researchers)
• Making sure we capture the best and most relevant content
• Continuing to develop our tools for access and harvesting (crawler.archive.org)
![Page 25: 1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006](https://reader036.vdocument.in/reader036/viewer/2022070323/56649d9e5503460f94a893aa/html5/thumbnails/25.jpg)
25
Internet Archive’s priorities
• Collaboration and Partnerships
– Continue to act as a technology partner in providing web archiving services to government and memory institutions
– Continue to develop Open Source software
– Develop common tools, storage formats and standards through the IIPC (International Internet Preservation Consortium)
– Open Content Alliance (OCA) digital books project
• Multiple copies across the world
– Within IA’s own facilities and with partners such as LC, Bnf, Library of Alexandria