documenting internet2 an it perspective eric celeste university of minnesota (twin cities) libraries...

27
Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December 2005 ...or... A joyful romp with Heritrix, JavaScript, & Spotlight!

Upload: patrick-cummings

Post on 14-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

Documenting Internet2an IT perspective

Eric CelesteUniversity of Minnesota (Twin Cities)

Librariesfor the Coalition for Networked Information

6 December 2005

...or... A joyful romp with Heritrix, JavaScript, & Spotlight!

Page 2: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

background...

• DI2 brought together– University of Minnesota (CBI)– University of Michigan (SI)– Internet2

• web crawling only a small part

• the “save everything” approach

Page 3: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

briefly…

• on crawling with spiders• on Heritrix and JavaScript• on Spotlight and local files• on sinkholes and strategies

Page 4: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

spiders on the web

Page 5: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

pages

Page 6: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

links

Page 7: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

hosts & domains

Page 8: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

robots.txt

Page 9: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

scope

Page 10: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

seeds

Page 11: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

excluded pages

Page 12: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 13: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 14: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 15: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 16: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 17: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 18: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

done!

Page 19: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

our crawler

• Heritrix, from the IA• aiming for broad deployment, Archive-It

• cross-platform, many users• simple setup, sophisticated options

• generates ARC files

Page 20: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

from ARC to archive

• keep originals intact• a few large files to manage• can serve a mirror from the master

• can extract files for research• solution requires Perl, PHP, JavaScript, MySQL

Page 21: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

processing...

• for mirroring online– optimizing and indexing with Perl

– loading into MySQL database– presenting via PHP

• for using on local disk– extracting files from ARC

Page 22: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

joys of javascript...

• modifies the page after loading

• HTML almost unmolested• changes explicit in code

Page 23: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

are we there yet?

• make the archive obvious• yet intrude as little as possible

Page 24: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

global research locally• a web site in your pocket• applying local tools• maintaining browse-ability• Apple’s Spotlight one of many

Page 25: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

sinkholes / strategies• partnership with institution

– config, IP, retention

• crawling far from perfect– no creation dates, exclusions– sticky traps, scripted pages (AJAX)

• scripts still immature– better demarcation– more self-contained (not at /)

Page 26: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

still...

• capture & save what we can• keep it as “original” as possible

• stay flexible for the future• have fun in the present!

Page 27: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

more information

• http://wiki.lib.umn.edu/DI2/

• Eric Celeste <[email protected]>