generating best effort preservation metadata for web resources at time of dissemination joan a....
TRANSCRIPT
Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination
Joan A. Smith & Michael L. NelsonOld Dominion University
Department of Computer ScienceNorfolk, VA 23529
{jsmit, mln}@cs.odu.edu
JCDL 2007Presented: 20 June 2007
Joint Conference on Digital Libraries 2007
20 June 2007 {jas,mln}@odu.edu Slide # 2
What’s In A Web Page?
20 June 2007 {jas,mln}@odu.edu Slide # 3
A Simple Web Page: Behind the Scenes
20 June 2007 {jas,mln}@odu.edu Slide # 4
HTTP: Behind the Scenes
Non-Text Resource example: http://foo.edu/jackJill.jpg
• Note the sparse metadata from the HTTP GET request• Binary content is not human-readable and does not even
display properly in the terminal window
We really need more metadata for the digital archeologist of the future:
– Color map– NISO information– Base64 encoding of resource– MD5 or other hash function– Subject matter
And more metadata would help preserve the Jack and Jill document, too:
– Language– Document summary/abstract– Keyword extraction– Lexical signature
% telnet foo.edu 80 Trying 82.165.199.160... Connected to foo.edu. Escape character is '^]'.
GET /jackJill.jpg HTTP/1.1 Host: foo.edu
HTTP/1.1 200 OK Date: Mon, 11 Jun 2007 16:49:25 GMT Server: Apache/1.3.33 (Unix) Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Accept-Ranges: bytes Content-Length: 15986 Content-Type: image/jpeg
ÿØÿà"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿĬê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!� �RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü b»[g¨øx^zè ²�"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿĬê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!� �RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü b»[g¨øx^zè�
Connection closed by foreign host.
20 June 2007 {jas,mln}@odu.edu Slide # 5
Preservation & Metadata
Resource Metadata Available
Less More
Pro
bab
ilit
y o
f P
res
erv
atio
n
Low
Hig
h
What I get from the HTTP/HTML
What I need to make an Archival Information Package (AIP)
AIP
20 June 2007 {jas,mln}@odu.edu Slide # 6
Post-Harvest Processing (at Ingest)
Harvest Analyze/Examine/Process Archive
Often a combination of manual and automated input
20 June 2007 {jas,mln}@odu.edu Slide # 7
Metadata Generation Utility Examples
Name Description
Jhove Analysis by type (img, audio, text)
Kea Key phrase extraction
OTS Open Text Summarizer
ExifTool Image/video metadata extractor
PDFlib-pCOS Extract PDF metadata
MP3-Tag Extract audio file tags
Essence Customized information extraction
GDFR MIME++
MD5 Message Digest
File Magic Uses content-identification bits of the file
20 June 2007 {jas,mln}@odu.edu Slide # 8
The Conscientious WebmasterHe who waits to do a great deal of good will never do anything. -- Samuel Johnson
Preservation is important…
But I’m soooo busy…
How to help???
20 June 2007 {jas,mln}@odu.edu Slide # 9
Configuring the Web-Server for Automatic Metadata
http://foo.edu/example.html
• No impact to everyday users
• Regular “GET” => “regular” response
• OAI-PMH “Get Record” => “crate” response
http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/example.html&metadataPrefix=crate
• Standard Apache “Location” directive
• mod_oai module configured with “plug-ins”
• Scripts, utilities, etc. can vary by MIME type
20 June 2007 {jas,mln}@odu.edu Slide # 10
Harvest with Metadata (at Dissemination)
Metadata Magic: Get the resource together with its metadata
Harvest Pre-processed resource
20 June 2007 {jas,mln}@odu.edu Slide # 11
Automatic Metadata via mod_oaihttp://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2007-06-18T18:21:46Z</responseDate> <request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg
metadataPrefix=“crate">http://foo.edu/crate/</request> <GetRecord>
<record> <header> <identifier>http://foo.edu/jackJill.jpg</identifier>
<datestamp>2007-01-17T04:09:07Z</datestamp><setSpec>mime:image:jpeg</setSpec>
</header><crateContent> <mimeType>image/jpeg encoding=“base64”</mimeType>
<data>JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc</data></crateContent><crateMetadata>
<description><label>“file magic”</label> <exec>/usr/bin/file jackJill.jpg</exec><version>file-4.16</version><data>JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26</data>
</description><description><label>“jhove”</label> <exec>/opt/jhove/jhove –m jpeg-hul</exec>
<version>Jhove (Rel. 1.1, 2006-06-05)</version><data> Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0</data>
</description></crateMetadata>
</record></GetRecord> </OAI-PMH>
20 June 2007 {jas,mln}@odu.edu Slide # 12
Preservation & Metadata
Resource Metadata Available
Less More
Pro
bab
ilit
y o
f P
res
erv
atio
n
Low
Hig
h
HTTP/HTML
Automatic metadata utilities/CRATE
Archival Information Package (AIP)
20 June 2007 {jas,mln}@odu.edu Slide # 13
Automatic, Best-Effort Metadata
• Unverified– Utility results are not cross-checked– Output of analyses directly into XML response
• Undifferentiated– No categorization of output– Resource and metadata cohabit response
• Automatic– Generated at time of dissemination– Integrates preservation functions with the web server
A simple, easy-to-implement option for improving
preservation metadata for web resources
20 June 2007 {jas,mln}@odu.edu Slide # 14
Further Information
• The mod_oai project home page:
http://www.modoai.org/• IWAW 2007:
“CRATE: A Simple Model for Self-Describing Web Resources”
• Authors’ webs:• http://www.cs.odu.edu/~mln/pubs/
• http://www.joanasmith.com/pubs.html
I Helped!