ERIKA Eesti Ressursid Internetis Kataloogimine ja
ArhiveerimineEstonian Resources in Internet, Indexing
and Archiving
Web Archiving
Experience
• Based on criterias (Collection Development Department)
• very narrow approach – traditional bibliographic view of “publication”
• document centered – only certain files, not the site as a whole
• shortcomings of this approach slide 5
(httrack, nedlib, webzip)
Future ways
• Whole “.ee” domain (crawler)
• Based on criteriascentered on the conception of “website” - could be subdomain, server or catalog(Collection Development Department)
Current situation
Database with title,
url, comment
, dates
HttrackWebsite copier
Archive on
a hard drive
Website informatio
n
Current situation
• Registration’s database• Httrack website copier• Httrack logs’ database• Website files not indexed, not packed (50 GB)
• Simple web interface • Record in OPAC links to original url
Using freeware for all
Current situation
Problems• Manual labour
• Missing web publications
• Missing the context of publications (user may prefer to browse archived website not to search by title and author of a certain file)
• Limiting out resources that are not “publications” but are important for the national memory (websites of political parties)
Advantages• Take only what we need
• Extensive bibliographic descriptions
• Everything is under control
software
• httrack– takes a list of urls– saves website as a mirror, all links internal
links are converted to point to local mirror
• tests in history– nedlib harvester, downloads files and
meta-info does not save internal structure–webzip (similar to httrack saves a mirror
problem – licenced software)
access
• httrack logs are read into database• access script – redirecting urls– content-type (problem with x.php?file=id
saves as html while in reality is pdf or else)
• OPAC records will be updated to point to the archive
• full-text index?
Perspectives
DB registartion
Access to filescheck for
correct saving
Websiteinforamtion
Harvester
Files Storedlocally
Indexing
Perspectives
• Test IIPC software (netpreserve.org)• Use a crawler to collect (heritrix)
• Cooperation with neti.ee – Nuhk (Spy) crawler?
Scope
• as much as possible• strict criterias
Perspectives
• Heritrix (http://crawler.archive.org/), Nutchwax, WERA
crawl + index + pack + web interface
• Neti.ee (www.neti.ee)
own crawler, database of current situation, only “estonian” ip addresses, buffer includes only text- no images or other formats
Perspectives
• Based on criterias system
more automated:
Steps:1) Login to a web interface2) Add link (links) of a webpage, some comments, update period3) System gets the page and saves to archive (packed or unpacked)4) Control the webpage saved in archive