Download - Preserving the web
![Page 1: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/1.jpg)
Preserving the Web: One institution’s foray into Digital Preservation through
Web Archiving
Jeremy FloydTexas A&M University – Commerce
[email protected] @jjamesfloyd
![Page 2: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/2.jpg)
Why save the web?
Google Data Center. The Dalles, Oregon 2012 <http://www.google.com/about/datacenters/gallery/>
![Page 3: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/3.jpg)
Approaches and Considerations
• Do It Yourself Approach• IT infrastructure• Level of ‘In-house’ Expertise• Long Term Digital
Preservation
• Hosted Solutions• Annual Expenditure• Options for Joining a
Consortium or Collaborative
Alington, Greg. 1936. “A Book Mark Would be Better.” Made for the Illinois WPA Art Project. from Library of Congress Print and Photographs Online Catalog <http://www.loc.gov/pictures/item/2011645389/>
![Page 4: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/4.jpg)
HTTrack
• Free open source software• Allows downloading of websites to
a local drive• Preserves content and structure of
target sites
![Page 5: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/5.jpg)
OCLC Web Harvester
• Runs OCLC’s own Webcrawler• Can Import Directly into
CONTENTdm and • Connexion Catalog• Discoverable in WorldCat• Can be Saved in OCLC Digital
Archive
![Page 6: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/6.jpg)
California Digital LibraryWeb Archiving Service
• Free to join for all UC departments and organization (charged only for storage)
• Fee based subscription service for all other institutions
• Utilizes Heritrix web crawler for capture and Wayback for display and Nutchwax search engine
• 56 public archives• 21 partners• 4407 web sites• 616,585,489
documents• 32.3 TB of data
![Page 7: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/7.jpg)
The Internet ArchiveArchive-It
• Subscription Service• Heritrix web crawler• Nutchwax search engine• Wayback Machine browser
-All developed and maintained by the Internet Archive
• More than 225 partner organizations• 5,214,935,471 URLs in 2,056
collections• Partners in 45 states and 15 countries
including, university libraries, state archives, historical societies, federal institutions, NGOs, public libraries, and museums
![Page 8: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/8.jpg)
Texas A&M University – Commerce partnered with Archive-It
![Page 9: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/9.jpg)
Gathering Support Among Constituencies and Stakeholders
All aboard! Liberty Bond fourth issue Sept. 28 - Oct. 19, 1918. from Library of Congress Print and Photographs Online Catalog <http://www.loc.gov/pictures/item/00652400/>
![Page 10: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/10.jpg)
Selecting Seed URLsUniversity Websiteshttp://www.tamuc.edu/http://web.tamuc.edu/ http://catalog.tamuc.edu/ http://pride.tamuc.edu/http://www.tamu-commercedining.com/http://tamuc.orgsync.com/ http://www.lionathletics.com/
Facebookhttp://www.facebook.com/tamucommerce/http://www.facebook.com/TAMUCLibraries/http://www.facebook.com/pages/AM-Commerce-Lion-Athletics/242136009137926?ref=ts/http://www.facebook.com/TAMUCspirit/http://www.facebook.com/tamucalumni/
Twitterhttp://twitter.com/TAMU_Commerce/http://twitter.com/Lion_Athletics/http://twitter.com/ketrradio/http://twitter.com/TheEastTexan/http://twitter.com/LionsAfterDark/http://twitter.com/TAMUC_News/http://twitter.com/LionSafety/http://twitter.com/TAMUCalumni/http://twitter.com/TAMUC_Mesquite/
Youtubehttp://www.youtube.com/user/LionsMedia/
University News and Mediahttp://www.ketr.org/ http://TheEastTexanOnline.com
![Page 11: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/11.jpg)
Managing Scope and Frequency of Crawls
![Page 12: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/12.jpg)
robots.txt
“Robots- Electro and Sparko” 1940. still image. Computer History Museum < http://www.computerhistory.org/collections/accession/102693536>
![Page 13: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/13.jpg)
Crawler Traps
“It’s A Trap” 2010. Know Your Meme <http://knowyourmeme.com/memes/its-a-trap>
![Page 14: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/14.jpg)
Adding Descriptive Metadata
Rebecca Goldman. 2009. “Core Values.” Derangement and Description. <http://derangementanddescription.wordpress.com/2009/07/13/core-values/>
![Page 15: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/15.jpg)
Establishing a Workflow
![Page 16: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/16.jpg)
Access and Future Growth
![Page 17: Preserving the web](https://reader034.vdocument.in/reader034/viewer/2022042714/54bd1aca4a7959135f8b4653/html5/thumbnails/17.jpg)
Further Resources• Niu, Jinfang. 2012. “An Overview of Web Archiving” D-Lib
Magazine. 18(3/4) http://www.dlib.org/dlib/march12/niu/03niu1.html
• LOC Signal Blog: http://blogs.loc.gov/digitalpreservation/• International Internet Preservation Consortium (IIPC)
http://netpreserve.org/• International Web Archiving Workshop (2001 – 2010)
http://www.iwaw.net/• Society of American Archivists: Web Archiving Roundtable
http://www2.archivists.org/groups/web-archiving-roundtable/email: [email protected]: @jjamesfloyd
http://www.slideshare.net/jjamesfloyd/preserving-the-web/