web archiving
DESCRIPTION
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving. Future ways Whole “.ee” domain (crawler) Based on criterias - PowerPoint PPT PresentationTRANSCRIPT
ERIKA Eesti Ressursid Internetis Kataloogimine ja
ArhiveerimineEstonian Resources in Internet, Indexing
and Archiving
Web Archiving
Experience
• Based on criterias (Collection Development Department)
• very narrow approach – traditional bibliographic view of “publication”
• document centered – only certain files, not the site as a whole
• shortcomings of this approach slide 5
(httrack, nedlib, webzip)
Future ways
• Whole “.ee” domain (crawler)
• Based on criteriascentered on the conception of “website” - could be subdomain, server or catalog(Collection Development Department)
Current situation
Database with title,
url, comment
, dates
HttrackWebsite copier
Archive on
a hard drive
Website informatio
n
Current situation
• Registration’s database• Httrack website copier• Httrack logs’ database• Website files not indexed, not packed (50 GB)
• Simple web interface • Record in OPAC links to original url
Using freeware for all
Current situation
Problems• Manual labour
• Missing web publications
• Missing the context of publications (user may prefer to browse archived website not to search by title and author of a certain file)
• Limiting out resources that are not “publications” but are important for the national memory (websites of political parties)
Advantages• Take only what we need
• Extensive bibliographic descriptions
• Everything is under control
software
• httrack– takes a list of urls– saves website as a mirror, all links internal
links are converted to point to local mirror
• tests in history– nedlib harvester, downloads files and
meta-info does not save internal structure–webzip (similar to httrack saves a mirror
problem – licenced software)
software
software
access
• httrack logs are read into database• access script – redirecting urls– content-type (problem with x.php?file=id
saves as html while in reality is pdf or else)
• OPAC records will be updated to point to the archive
• full-text index?
Perspectives
DB registartion
Access to filescheck for
correct saving
Websiteinforamtion
Harvester
Files Storedlocally
Indexing
Perspectives
• Test IIPC software (netpreserve.org)• Use a crawler to collect (heritrix)
• Cooperation with neti.ee – Nuhk (Spy) crawler?
Scope
• as much as possible• strict criterias
Perspectives
• Heritrix (http://crawler.archive.org/), Nutchwax, WERA
crawl + index + pack + web interface
• Neti.ee (www.neti.ee)
own crawler, database of current situation, only “estonian” ip addresses, buffer includes only text- no images or other formats
Perspectives
• Based on criterias system
more automated:
Steps:1) Login to a web interface2) Add link (links) of a webpage, some comments, update period3) System gets the page and saves to archive (packed or unpacked)4) Control the webpage saved in archive
Perspectives
ERIKA DIGAR
DIGAR includes objects with controlled structure onlyIn pdf format.
ERIKA can contain any format you get from the internet