a new model for web resource harvesting
Post on 02-Feb-2016
37 Views
Preview:
DESCRIPTION
TRANSCRIPT
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
A New Model for Web Resource Harvesting
Her
This work supported in part by the Andrew Mellon Foundation & Library of Congress
Michael Nelson
Computer Science Department
Old Dominion University
Herbert Van de Sompel
Digital Library Research & Prototyping Team
Research Library, Los Alamos National Laboratory
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
Outline
(0) The Problem
(1) mod_oai
(2) Future Research
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
WWW and DL: Separated at Birth
1994
DL
WWW
Today
The Good: XML, BitTorrent, Web ServicesThe Bad: RSSThe Ugly: Semantic Web
The Good: OAIS, DOI, OAI-PMHThe Bad: Dublin CoreThe Ugly: SRU/W
The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered.
WWW
DL
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
www.getty.edu
doc1; last mod2003-03-12
doc2; last mod2002-07-19
doc100; last mod2003-09-11
…
what documents have beenmodified since 2003-11-15 ?
robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG
Web Robots
what is this file?what are its relationships to other files?how often does it change?
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
A More Efficient Way
what documents have beenmodified since 2003-11-15 ?
www.getty.eduwith mod_oai
doc1; last mod2003-03-12
doc2; last mod2002-07-19
doc100; last mod2003-09-11
…
<co> <metadata/> <link/> <link/> <change/> …</co>
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
Outline
(0) The Problem
(1) mod_oai
(2) Future Research
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
• Goal: integrate OAI-PMH functionality into the web server itself…
• mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server
o written in Co respects values in .htaccess, httpd.conf
• compile mod_oai on http://www.foo.edu/
• baseURL is now http://www.foo.edu/modoaio Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)
- http://www.foo.edu/modoai?
verb=ListIdentifiers &
metdataPrefix=oai_dc &
from=2004-09-15 &
set=mime:video:mpeg
mod_oai approach
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
OAI-PMH data model in mod_oai
resource
item
Dublin Coremetadata records
OAI-PMH identifier = entry point to all records pertaining to the resource
MPEG-21DIDL
metadata pertainingto the resource
HTTP headermetadata
http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf
OAI-PMH setsMIME type
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
OAI-PMH Entity value description
Resource URL PDF, PS, XML, HTML or other file
Item
identifier OAI Identifier DNS-based name of metadata about resource
set membership LCSH Library of Congress Subject Heading
Record
metadataPrefix oai_dc bibliographic metadata in Dublin Core
datestamp 2004-10-18 modification date of DC record
Record
metadataPrefix oai_marc bibliographic metadata in MARC
datestamp 2004-07-31 modification date of MARC record
OAI-PMH concepts : typical repository
OAI-PMH Entity value description
Resource URL HTML, GIF, PDF or other web file
Item
identifier URL same URL as the resource
set membership MIME type MIME type of the resource
Record
metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD
datestamp 2004-07-31 modification date of resource
Record
metadataPrefix oai_dc a subset of http_header in DC
datestamp 2004-07-31 modification date of resource
Record
metadataPrefix oai_didl MPEG-21 DIDL: base64 encoded resource + http_header metadata
datestamp 2004-07-31 modification date of resource
OAI-PMH concepts : mod_oai
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
harvester • issues a ListIdentifiers, • finds URLs of updated
resources• does HTTP GETs
updates only
• can get URLs of resources with specified MIME types
Resource Discovery: ListIdentifiers
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
Preservation: ListRecords
harvester • issues a ListRecords, • Gets updates as MPEG-
21 DIDL documents (HTTP headers, resource By Value or By Reference)
• can get resources with specified MIME types
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
wget mod_oaiindex.htmlas seed
"find . -type f"as seed
files
# of files in baseline 709 5739 5268# of files in update(25%)
114 1318 1335
performance of mod_oai and wgeton www.cs.odu.edu
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
Readings
• Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Terry L. Harrison, Nathan McFarland. mod_oai: An Apache Module for Metadata Harvesting. http://arxiv.org/abs/cs.DL/0503069
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
Outline
(0) The Problem
(1) mod_oai
(2) Future Research
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
Issues and Future Work
• For a given server, there are a set of URLs, U, and a set of files Fo Apache maps U Fo mod_oai maps F U
• Neither function is 1-1 nor ontoo We can easily check if a single u maps to F, but given F we cannot (easily)
generate U• Short-term issues:
o dynamic files- exporting unprocessed server-side files would be a security hole
o IndexIgnore- httpd will “hide” valid URLs
o File permissions- httpd will advertise files it cannot read
• Long-term issueso Alias, Location
- files can be covered up by the httpdo UserDir
- interactions between the httpd and the filesystem
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
IndexIgnore & File Permissions
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
Alias: Covering Up Files
httpd.conf:Alias /A /usr/local/web/htdocs/BAlias /B /usr/local/web/htdocs/A
the files “A” and “B” will be different from the URLshttp://server/Ahttp://server/B
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
UserDir: “Just in Time” mounting of directories
whiskey.cs.odu.edu:/ftp/WWW/conf% ls /homeliu_x/ mln/whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso/home/tharriso/whiskey.cs.odu.edu:/ftp/WWW/conf % ls /homeliu_x/ mln/ tharriso/whiskey.cs.odu.edu:/ftp/WWW/conf %
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
Looking Further Down the Road for mod_oai
• “Reverse” the method of URL discoveryo cannot look to the files;o listen to incoming requests and build a list of valid URLs
- could be seeded with files at start
- also the method for handling server processed files / URLs
• Plug-ins for descriptive metadata o DC tags in HTMLo MS Office formats, PDFo Tags from JPEG, TIFF, MP3, etc.
• Additional metadata in the DIDLo technical metadata from JHOVEo estimated change rate
- cf. Cho & Garcia-Molina, ACM TOIT 28(4)
• http log access as separate metadata formats- cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
Expanding OAI-PMH / Complex Object Access
• OAI-PMH / CO access for:o blogso message boardso native file systems
- e.g. Mac OS X “Spotlight”
• More aggressive use of OAI-PMH / CO for preservationo recently funded NSF DIGARCH programo use for preservation:
- Usenet - Email- Multicasting
OAI-PMH for Resource Harvesting TutorialOAI4, October 20th 2005, CERN, Geneva, Switzerland
OAI-PMH + Complex Objects:A New Model for Web Resource Harvesting
• Better web harvesting can be achieved through:o OAI-PMH: structured access to updates o Complex object formats: modeled representation of digital objects
• Use cases:o Preservation (ListRecords)o Web crawling (ListIdentifiers)
• mod_oai: reference implementationo Better performance than wgeto static files only; dynamic files in the futureo not a replacement for DSpace, Fedora, eprints.org, etc.
• More info:o http://www.modoai.org/o http://whiskey.cs.odu.edu/
top related