web archiving

15
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving

Upload: virgil

Post on 12-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving. Future ways Whole “.ee” domain (crawler) Based on criterias - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Archiving

ERIKA Eesti Ressursid Internetis Kataloogimine ja

ArhiveerimineEstonian Resources in Internet, Indexing

and Archiving

Page 2: Web Archiving

Web Archiving

Experience

• Based on criterias (Collection Development Department)

• very narrow approach – traditional bibliographic view of “publication”

• document centered – only certain files, not the site as a whole

• shortcomings of this approach slide 5

(httrack, nedlib, webzip)

Future ways

• Whole “.ee” domain (crawler)

• Based on criteriascentered on the conception of “website” - could be subdomain, server or catalog(Collection Development Department)

Page 3: Web Archiving

Current situation

Database with title,

url, comment

, dates

HttrackWebsite copier

Archive on

a hard drive

Website informatio

n

Page 4: Web Archiving

Current situation

• Registration’s database• Httrack website copier• Httrack logs’ database• Website files not indexed, not packed (50 GB)

• Simple web interface • Record in OPAC links to original url

Using freeware for all

Page 5: Web Archiving

Current situation

Problems• Manual labour

• Missing web publications

• Missing the context of publications (user may prefer to browse archived website not to search by title and author of a certain file)

• Limiting out resources that are not “publications” but are important for the national memory (websites of political parties)

Advantages• Take only what we need

• Extensive bibliographic descriptions

• Everything is under control

Page 6: Web Archiving

software

• httrack– takes a list of urls– saves website as a mirror, all links internal

links are converted to point to local mirror

• tests in history– nedlib harvester, downloads files and

meta-info does not save internal structure–webzip (similar to httrack saves a mirror

problem – licenced software)

Page 7: Web Archiving

software

Page 8: Web Archiving

software

Page 9: Web Archiving

access

• httrack logs are read into database• access script – redirecting urls– content-type (problem with x.php?file=id

saves as html while in reality is pdf or else)

• OPAC records will be updated to point to the archive

• full-text index?

Page 10: Web Archiving
Page 11: Web Archiving

Perspectives

DB registartion

Access to filescheck for

correct saving

Websiteinforamtion

Harvester

Files Storedlocally

Indexing

Page 12: Web Archiving

Perspectives

• Test IIPC software (netpreserve.org)• Use a crawler to collect (heritrix)

• Cooperation with neti.ee – Nuhk (Spy) crawler?

Scope

• as much as possible• strict criterias

Page 13: Web Archiving

Perspectives

• Heritrix (http://crawler.archive.org/), Nutchwax, WERA

crawl + index + pack + web interface

• Neti.ee (www.neti.ee)

own crawler, database of current situation, only “estonian” ip addresses, buffer includes only text- no images or other formats

Page 14: Web Archiving

Perspectives

• Based on criterias system

more automated:

Steps:1) Login to a web interface2) Add link (links) of a webpage, some comments, update period3) System gets the page and saves to archive (packed or unpacked)4) Control the webpage saved in archive

Page 15: Web Archiving

Perspectives

ERIKA DIGAR

DIGAR includes objects with controlled structure onlyIn pdf format.

ERIKA can contain any format you get from the internet