an apache module for generating self-describing web resources joan a. smith michael l. nelson...

32
An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation East Coast Meeting 16 October 2007

Upload: ira-neal

Post on 20-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

An Apache Module for GeneratingSelf-Describing Web Resources

Joan A. Smith

Michael L. Nelson

Alliance for Information Science and Technology Innovation

East Coast Meeting

16 October 2007

Page 2: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 2

Web Site Preservation: 2 Problems

The counting problemHow many pages are on that site?

To save it you have to find it

The representation problemWhat’s that page all about?

Future use requires understanding

Guess the bean count, win the jar

Page 3: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 3

A Crawler’s View of the Web Site

web roothttp://www.foo.edu/

X

X X

The crawler has run into the counting problem, and doesn’t know it….

Page 4: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 4

Pages Out of Crawler Reach

• Some pages linked from web root• Some dynamic content• Some orphaned pages• Some pages protected with access controls• Some pages too deep for a particular crawler

Sitemap protocol attempts to address these problems

Page 5: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 5

Sitemap Protocol

• Google-driven initiative– Derives from earlier concept of graphical “site map” and

alphabetical “site index”– Google standardized the protocol

• XML-formatted file– Created by webmaster of a web site– Can be “tweaked” to include/exclude files based on a wide

variety of criteria

• Simplifies site resource exposure to search engines (i.e., to their robots)

• Supported by Google, MSN, Yahoo and ASK

But it doesn’t address inherent HTTP limitations

Page 6: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 6

Web Crawling & The Counting Problem• HTTP cannot ask for only new or modified resources

– Conditional GET by datestamp or etag has limited benefit

– Cannot get a list of pages that have been deleted; changed; added

– Each resource must be requested, one at a time, by name

• There is no “SELECT *” in HTTP

– Crawlers cannot request a list of all URLs for the site

– Crawlers can only GET one resource at a time, by name

– HTTP cannot give a crawler a list of resources it has

Undiscovered resources will not be refreshed• Sitemaps

– XML document lays out site structure (cf. http://www.sitemaps.org/protocol.php )

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url></urlset>

– Provides minimal, crawl-oriented metadata (update frequency, etc.)

– Can include Dynamic URLs

CountingProblem

Search EngineSolution

Page 7: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 7

A Web Page: Behind the Scenes

Page 8: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 8

HTTP: Behind the Scenes

Resource example:http://foo.edu/jackJill.jpg

• Note the limited metadata from the HTTP GET request• Binary content is not human-readable• We only “GET” one resource at a time

Additional metadata could help the digital image archeologist of the future:

– Color map– NISO information– Base64 encoding of resource– MD5 or other hash function– Subject matter

And metadata that could help preserve the Jack and Jill document content:

– Language– Script type and version– Document summary/abstract– Keyword extraction– Lexical signature

% telnet foo.edu 80 Trying 82.165.199.160... Connected to foo.edu. Escape character is '^]'.

GET /jackJill.jpg HTTP/1.1 Host: foo.edu

HTTP/1.1 200 OK Date: Mon, 11 Jun 2007 16:49:25 GMT Server: Apache/1.3.33 (Unix) Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Accept-Ranges: bytes Content-Length: 15986 Content-Type: image/jpeg

ÿØÿà"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿĬê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!� �RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü b»[g¨øx^zè ²�"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿĬê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!� �RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü b»[g¨øx^zè�

Connection closed by foreign host.

Page 9: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 9

Web Crawling & The Representation Problem

• HTTP provides limited metadata– Server-client communication, focused on the here and now– Not concerned with preservation-related information– Format obsolescence is not addressed– File content comprehension is the client’s problem

• Client and Server must be configured to handle each resource “type”– Default includes most common current file types– Older types, unusual files may get an error message from the client browser:

Page 10: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 10

Archives: Metadata-Rich

• Each model handles resource representation information in its own way• Metadata helps ensure long-term persistence, availability of resources

Page 11: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 11

The MPEG-21 DIDL Model

• A complex-object model combining the resource and its metadata together

Page 12: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 12

Post-Harvest Processing required for ingestion

Harvest Analyze/Examine/Process Archive

Often a combination of manual and automated input

Page 13: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 13

Metadata Generation Utility Examples

Name Description

Jhove Analysis by type (img, audio, text)

Kea Key phrase extraction

OTS Open Text Summarizer

ExifTool Image/video metadata extractor

PDFlib-pCOS Extract PDF metadata

MP3-Tag Extract audio file tags

Essence Customized information extraction

GDFR MIME++

MD5 Message Digest

File Magic Uses content-identification bits of the file

Page 14: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 14

# Webs >> # Archiving Institutions

Archivist

Web Sites

Typical ingest scenario

Page 15: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 15

Harvest with Metadata

Metadata Magic: Get the resource together with its metadata

Harvest Pre-processed resource

Page 16: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 16

Harnessing the Web Server

Archivist: mod_oai GetRecord request and response

User: standard GET request and response

Self-describing resource

Page 17: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 17

Configuring the Web-Server for Metadata Magic

http://foo.edu/example.html

• No impact to everyday users

• Regular “GET” => “regular” response

• OAI-PMH “Get Record” => “crate” response

http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/example.html&metadataPrefix=crate

• Standard Apache “Location” directive

• mod_oai module configured with “plug-ins”

• Scripts, utilities, etc. can vary by MIME type

<Location /modoai> SetHandler modoai-handler modoai_plugin "jhove" "/opt/jhove/jhove -m jpeg-hul %s" "/opt/jhove/jhove --v" "image/jpeg" modoai_plugin "ots" "/usr/local/bin ots –summary %s" "/usr/local/bin ots -v" "text/*" modoai_plugin "jhove" "/opt/jhove/jhove -m pdf-hul %s" "/opt/jhove/jhove --v" "application/pdf" modoai_plugin "pronom" "java -jar DROID.jar -L%s" "java -jar DROID.jar -v" "*/*" </Location /modoai>

Page 18: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 18

6 Verbs of the OAI-PMH

Verb Function

Identify description of repository

ListMetadataFormats metadata formats supported by repository

ListSets sets defined by repository

ListIdentifiers OAI unique ids contained in repository

ListRecords listing of N records

GetRecord listing of a single record

metadataabout therepository

harvesting

verbs

most verbs can take qualifying arguments: dates, sets, ids, metadata formats, and resumption token (for flow control)

• Compatible with HTTP• Supports OAIS model• Can support complex object model

OAI-PMH can help resolve the Counting and Representation Problems

Page 19: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 19

OAI-PMH Verbs and mod_oai:Addressing The Counting & Representation Problems

• Counting Problem– “ListIdentifiers” provides equivalent of Sitemap– “ListRecords” response serializes the site’s contents using a single

request– Qualifiers – by date range, by MIME set – enable customized crawls– Simplifies update semantics

• Representation Problem– “ListRecords” and “GetRecords” responses can include a wealth of

metadata• MPEG-21 DIDL• CRATE

– Allows more sophisticated crawling and archival preparation

OAI-PMH verbs address inherent HTTP limitations

mod_oai lets the web server provide self-describing resources

Page 20: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 20

What is a “Self-Describing” Resource?

EXIF TOOL:File Name 103_0315.JPGCamera Model Name Canon EOS DIGITAL REBELDate/Time Original 2003:09:30 13:37:51Shooting Mode SportsShutter Speed 1/2000Aperture 7.1Metering Mode EvaluativeExposure Compensation 0ISO 400Lens 75.0 - 300.0mmFocal Length 300.0mmImage Size 3072x2048Quality NormalFlash OffWhite Balance AutoFocus Mode AI Servo AFContrast +1Sharpness +1Saturation +1Color Tone NormalFile Size 1606 kBFile Number 103-0315

Standard HTTP Headers --Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Content-Length: 15986 Content-Type: image/jpeg

PLUS: Output from built-in utilities:

JHOVE TOOL:Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0File/Magic:

JPEG image dataJFIF standard 1.00resolution (DPI)"LEAD Technologies Inc. V1.01“33 x 26

MD5 Hash:58a54e8638db432f4515eedf89f44505

…CRATE: Wrapped together with the resource in simple XML

Page 21: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 21

Apache: mod_oai Location Directive

<Location /modoai> Apply these rules to http://foo.edu/modoai SetHandler modoai-handler Use modoai to process these requests

modoai_plugin plugin element: one utility per element "jhove" each has a label, used as a metadata “ID tag” "/opt/jhove/jhove -m jpeg-hul %s" the command-line or script to call the utility "/opt/jhove/jhove --v" include the version number of the installed utility

on a single text line

"image/jpeg" which MIME types should be analyzed (any jpeg)

EOL here modoai_plugin "ots" Open Text Summarizer "/usr/local/bin ots –summary %s" “%s” means substitute resource name here "/usr/local/bin ots -v" "text/*" Use on all text (plain, HTML, XML, etc.) resources modoai_plugin "jhove" Another invocation of the JHOVE utility "/opt/jhove/jhove -m pdf-hul %s" Note the different hul used here "/opt/jhove/jhove --v" report the version "application/pdf" Use on all PDF resources (only) modoai_plugin "pronom" the PRONOM DROID tool "java -jar DROID.jar -L%s" "java -jar DROID.jar -v" report the version "*/*" Use this utility on every resource </Location /modoai>

• Scripts• Pipes• Executables• MIME-based selective

processing

Page 22: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 22

Building a CRATE

• URI, UUID

• Standard HTTP Headers

• Plug-In Metadata

• Base64-Encoded Resource

CRATE

CRATE ID

METADATA

RESOURCE

A simple model for self-describing web resources

Page 23: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 23

A self-describing resource using mod_oaihttp://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate

<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2007-06-18T18:21:46Z</responseDate> <request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg

metadataPrefix=“crate">http://foo.edu/crate/</request> <GetRecord>

<record> <header> <identifier>http://foo.edu/jackJill.jpg</identifier>

<datestamp>2007-01-17T04:09:07Z</datestamp><setSpec>mime:image:jpeg</setSpec>

</header><crateContent> <mimeType>image/jpeg encoding=“base64”</mimeType>

<data>JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc</data></crateContent><crateMetadata>

<description><label>“file magic”</label> <exec>/usr/bin/file jackJill.jpg</exec><version>file-4.16</version><data>JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26</data>

</description><description><label>“jhove”</label> <exec>/opt/jhove/jhove –m jpeg-hul</exec>

<version>Jhove (Rel. 1.1, 2006-06-05)</version><data> Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0</data>

</description></crateMetadata>

</record></GetRecord> </OAI-PMH>

Page 24: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 24

Automatic, Best-Effort Metadata

• Automatic– Generated at time of dissemination– Integrates preservation functions with the web server

• Unverified– Utility results are not cross-checked– Output of analyses go directly into XML response

• Undifferentiated– No categorization of output– Resource and metadata form complex-object response

A simple, easy-to-implement option for improving

available preservation metadata for web resources

Page 25: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 25

Preservation & Metadata

Resource Metadata Available

Less More

Pro

bab

ilit

y o

f P

res

erv

atio

n

Low

Hig

h

HTTP/HTML

Automatic metadata utilities/CRATE

Archival Information Package (AIP)

Page 26: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 26

Current Status

• mod_oai Open Source release at project’s completion• Draft CRATE schema definition (XSD)• Metrics Collection & Evaluation

– Impact of utilities on web server performance– Examine utility compatibility and issues– Address security concerns

• Native Utility Efficiency– Language dependent (Java, C)– Improvements may depend on external pressure

• Security– Metadata vs information exposure risk– Access controls

Page 27: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 27

Demo

AT MODOAI.ORG:

http://www.modoai.org/demos.html

Page 28: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 28

Further Information

• The mod_oai project home page:

http://www.modoai.org/• JCDL 2007:

Generating Best Effort Preservation Metadata For Web Resources At Time Of Dissemination

• IWAW 2007:CRATE: A Simple Model For Self-Describing Web Resources

• Authors’ webs:• http://www.cs.odu.edu/~mln/pubs/

• http://www.joanasmith.com/pubs.html

Page 29: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

Supplementary Slides

Page 30: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 30

Robot Crawls of A Large, Deep Web

Google example here: http://www.joanasmith.com/deepWeb/animOdu2.gif

Page 31: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 31

Addressing the Counting Problem Using OAI-PMH via mod_oai

Advantages for Crawler: • Single request itemizes all

resources in web tree: ListIdentifiers

• Can refine by MIME set, Datestamp

Original modoai was limited:

• No Dynamic URLs

• Web root tree only

• Same metadata as HTTP

Basic request: http://www.foo.edu/modoai/?verb=ListIdentifiers&metadataPrefix=oai_dc

Enhanced request: &from=2006-09-15&set=mime:video:mpeg

New version: • Utilizes sitemap files• Can include Dynamic URLs• Rich metadata possibilities via

“plugins”

Page 32: An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation

16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 32

Web Server Configuration: “conf” file ### Section 1: Global Environment # ServerType standalone ServerRoot "/etc/httpd" PidFile /var/run/httpd.pid ResourceConfig /dev/null AccessConfig /dev/null Timeout 300 KeepAlive On MaxKeepAliveRequests 0 KeepAliveTimeout 15 MinSpareServers 16 MaxSpareServers 64 StartServers 16 MaxClients 512 MaxRequestsPerChild 100000

### Section 2: 'Main' server configuration

# Port 80

<IfDefine SSL> Listen 80 Listen 443 </IfDefine>

User www Group www ServerAdmin [email protected] ServerName www.openna.com DocumentRoot "/home/httpd/ona"

<Directory /> Options None AllowOverride None Order deny,allow Deny from all </Directory>

<Directory "/home/httpd/ona"> Options None AllowOverride None Order allow,deny Allow from all </Directory>

<Files .pl> Options None AllowOverride None Order deny,allow Deny from all </Files>

<IfModule mod_dir.c> DirectoryIndex index.htm index.html index.php index.php3 default.html index.cgi </IfModule>

#<IfModule mod_include.c> #Include conf/mmap.conf #</IfModule>

UseCanonicalName On

<IfModule mod_mime.c> TypesConfig /etc/httpd/conf/mime.types </IfModule>

DefaultType text/plain HostnameLookups Off

• Operational Rules• Modules (mod_perl, etc.)• Security• Virtual Hosts