generating best effort preservation metadata for web resources at time of dissemination joan a....

14
Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science Norfolk, VA 23529 {jsmit, mln}@cs.odu.edu JCDL 2007 Presented: 20 June 2007 Joint Conference on Digital Libraries 2007

Upload: bernice-powers

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination

Joan A. Smith & Michael L. NelsonOld Dominion University

Department of Computer ScienceNorfolk, VA 23529

{jsmit, mln}@cs.odu.edu

JCDL 2007Presented: 20 June 2007

Joint Conference on Digital Libraries 2007

Page 2: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 2

What’s In A Web Page?

Page 3: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 3

A Simple Web Page: Behind the Scenes

Page 4: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 4

HTTP: Behind the Scenes

Non-Text Resource example: http://foo.edu/jackJill.jpg

• Note the sparse metadata from the HTTP GET request• Binary content is not human-readable and does not even

display properly in the terminal window

We really need more metadata for the digital archeologist of the future:

– Color map– NISO information– Base64 encoding of resource– MD5 or other hash function– Subject matter

And more metadata would help preserve the Jack and Jill document, too:

– Language– Document summary/abstract– Keyword extraction– Lexical signature

% telnet foo.edu 80 Trying 82.165.199.160... Connected to foo.edu. Escape character is '^]'.

GET /jackJill.jpg HTTP/1.1 Host: foo.edu

HTTP/1.1 200 OK Date: Mon, 11 Jun 2007 16:49:25 GMT Server: Apache/1.3.33 (Unix) Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Accept-Ranges: bytes Content-Length: 15986 Content-Type: image/jpeg

ÿØÿà"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿĬê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!� �RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü b»[g¨øx^zè ²�"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿĬê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!� �RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü b»[g¨øx^zè�

Connection closed by foreign host.

Page 5: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 5

Preservation & Metadata

Resource Metadata Available

Less More

Pro

bab

ilit

y o

f P

res

erv

atio

n

Low

Hig

h

What I get from the HTTP/HTML

What I need to make an Archival Information Package (AIP)

AIP

Page 6: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 6

Post-Harvest Processing (at Ingest)

Harvest Analyze/Examine/Process Archive

Often a combination of manual and automated input

Page 7: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 7

Metadata Generation Utility Examples

Name Description

Jhove Analysis by type (img, audio, text)

Kea Key phrase extraction

OTS Open Text Summarizer

ExifTool Image/video metadata extractor

PDFlib-pCOS Extract PDF metadata

MP3-Tag Extract audio file tags

Essence Customized information extraction

GDFR MIME++

MD5 Message Digest

File Magic Uses content-identification bits of the file

Page 8: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 8

The Conscientious WebmasterHe who waits to do a great deal of good will never do anything. -- Samuel Johnson

Preservation is important…

But I’m soooo busy…

How to help???

Page 9: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 9

Configuring the Web-Server for Automatic Metadata

http://foo.edu/example.html

• No impact to everyday users

• Regular “GET” => “regular” response

• OAI-PMH “Get Record” => “crate” response

http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/example.html&metadataPrefix=crate

• Standard Apache “Location” directive

• mod_oai module configured with “plug-ins”

• Scripts, utilities, etc. can vary by MIME type

Page 10: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 10

Harvest with Metadata (at Dissemination)

Metadata Magic: Get the resource together with its metadata

Harvest Pre-processed resource

Page 11: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 11

Automatic Metadata via mod_oaihttp://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate

<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2007-06-18T18:21:46Z</responseDate> <request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg

metadataPrefix=“crate">http://foo.edu/crate/</request> <GetRecord>

<record> <header> <identifier>http://foo.edu/jackJill.jpg</identifier>

<datestamp>2007-01-17T04:09:07Z</datestamp><setSpec>mime:image:jpeg</setSpec>

</header><crateContent> <mimeType>image/jpeg encoding=“base64”</mimeType>

<data>JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc</data></crateContent><crateMetadata>

<description><label>“file magic”</label> <exec>/usr/bin/file jackJill.jpg</exec><version>file-4.16</version><data>JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26</data>

</description><description><label>“jhove”</label> <exec>/opt/jhove/jhove –m jpeg-hul</exec>

<version>Jhove (Rel. 1.1, 2006-06-05)</version><data> Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0</data>

</description></crateMetadata>

</record></GetRecord> </OAI-PMH>

Page 12: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 12

Preservation & Metadata

Resource Metadata Available

Less More

Pro

bab

ilit

y o

f P

res

erv

atio

n

Low

Hig

h

HTTP/HTML

Automatic metadata utilities/CRATE

Archival Information Package (AIP)

Page 13: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 13

Automatic, Best-Effort Metadata

• Unverified– Utility results are not cross-checked– Output of analyses directly into XML response

• Undifferentiated– No categorization of output– Resource and metadata cohabit response

• Automatic– Generated at time of dissemination– Integrates preservation functions with the web server

A simple, easy-to-implement option for improving

preservation metadata for web resources

Page 14: Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department

20 June 2007 {jas,mln}@odu.edu Slide # 14

Further Information

• The mod_oai project home page:

http://www.modoai.org/• IWAW 2007:

“CRATE: A Simple Model for Self-Describing Web Resources”

• Authors’ webs:• http://www.cs.odu.edu/~mln/pubs/

• http://www.joanasmith.com/pubs.html

I Helped!