bitter harvest metadata harvesting issues, problems, and possible solutions roy tennant california...
TRANSCRIPT
Bitter Harvest Metadata Harvesting Issues,
Problems, and Possible Solutions
Roy TennantCalifornia Digital Library
Outline
Brief Harvesting OverviewHarvesting ProblemsSteps to a Fruitful HarvestA Harvesting Service ModelIndexing and InterfacesWhat’s Next?
Open Archives Initiative
Open Archives Initiative: “develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content”Huh? Let’s just say it’s an effort to help people find stuff Protocol for Metadata Harvesting (OAI-PMH) specifies how repositories can expose their metadata for others to harvestWell over 500 repositories world-wide support the protocolOAIster.org has indexed 3.5 million items from those repositories
OAI-PMHData providers (DP) — those with the stuffService providers (SP) — those who harvest metadata and provide aggregation and search servicesOAI-PMH verbs:
IdentifyListIdentifiersListMetadataFormatsListSetsListRecordsGetRecord
Software for both DPs and SPs readily available
www.oaforum.org/tutorial/
OAI Architecture
Source: Open Archives Forum Tutorial
gita.grainger.uiuc.edu/registry/
errol.oclc.org
Harvesting Problems
SetsMetadata FormatsMetadata ArtifactsGranularityMetadata Variances
Sets
Records are harvested in clumps, called “sets” created by DPsNo guidelines exist for defining setsExamples:
CollectionOrganizational structureFormat (but is a page image an image? See example)
Metadata Formats
Only required format is simple Dublin Core, although any format can be made available in additionFew DPs surface richer metadataSimple DC is simply too simple!Example (artifact vs. surrogate dates)
Metadata Artifacts
“unintended, unwanted aberrations”Sample causes:
Idiosyncratic local practicesAnachronismsHTML code
Examples: Circa = string of dates for searching purposes[electronic resource]
Granularity
Record Granularity: what is an “object”?
A book, or each individual page?Examples: CDL, Univ. of Michigan
Metadata Granularity: Multiple values in one fieldExample: Univ. of Washington
Metadata Variances
Subject terminology differencesDisparities in recording the same metadata
Example: date variances
Mapping oddities or mistakesExamples: 1) format into description, 2) description into subject
Steps to a Fruitful Harvest
Needs Assessment (it’s the user, stupid)DP Identification and CommunicationMetadata CaptureMetadata AnalysisMetadata SubsettingMetadata NormalizationMetadata EnrichmentIndexingInterface (it’s still the user, stupid)
Needs Assessment
What are you trying to accomplish?What will your users want to be able to do?What metadata will you need, and what procedures will you need to set up to enable these activities?Which repositories have what you want?Is what they have (e.g., sets, metadata) usable as is, or ?
DP Identification & Communication
Identification:Use UIUC directory of DPs to identify potential sources
Communication:Not required to tell them you are harvesting, but may help establish a good relationshipMay want to request that they surface a richer metadata format and/or provide a different set
Metadata Capture
Sample questions to answer:Individual sets, or all?Richer metadata formats available?How frequently to reharvest?Start from scratch each time or update?
Many software options
+-----------------------------------------+| Harvester Sample Configurator |+-----------------------------------------+| Version 1.1 :: July 2002 || Hussein Suleman <[email protected]> || Digital Library Research Laboratory || www.dlib.vt.edu :: Virginia Tech |------------------------------------------+
Defaults/previous values are in brackets - press <enter> to accept thoseenter "&delete" to erase a default valueenter "&continue" to skip further questions and use all defaultspress <ctrl>-c to escape at any time (new values will be lost)
Press <enter> to continue
[ARCHIVES]Add all the archives that should be harvested
Current list of archives:No archives currently defined !
Select from: [A]dd [D]oneEnter your choice [D] : a{return}
[ARCHIVE IDENTIFIER]You need a unique name by which to refer to the archive youwill harvest metadata fromExamples: nsdl-380602, VTETD
Archive identifier [] : nsdl-380602{return}
Virginia Tech Perl Harvester
Metadata Analysis
Finding out what you have (and don’t have)
Encoding practicesGap analysis (e.g., missing fields, etc.)Mistakes (e.g., mapping errors)
Software can helpCommercial software like SpotfireIn-house or open source software tools
Source: 2002 Master’s Thesis, Jewel Hope Ward, UNC Chapel Hill
Five elements are used 71% of the time
Metadata Analysis Model
Metadata Subsetting
DP sets are unlikely to serve all SP uses wellSPs will need the ability to subset harvested metadataExample: prototype subsetting tool
A Subsetting Model
Metadata Normalization
Normalizing: to reduce to a standard or normal statePrototype date normalization service screen
Metadata Enrichment
Adding fields or values may be useful or required, for example:
Metadata provider informationGeographic coverageSubject terms mapped to a different thesaurusAuthority control record
A Harvesting Service Model
Indexing
Pick your favorite database/indexing software:
MySQLSWISH-E
May need to specifically set up a method to search across the entire recordMay need different fields for indexing than for display
Interface
Software interface (API) for other applications:
SRU/SRW?Arbitrary Web Services schema?
User interface
What’s Next?
Further protocol developmentServices layered on top of OAI-PMHShared software toolsBest practices for both DPs and SPs
oai-best.comm.nsdl.org