wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell...
DESCRIPTION
In the January 1994 issue of The Cornell Veterinarian editor Maurice E. White wrote: THIS is the last issue of "The Cornell Veterinarian". The "Cornell Vet" has a proud history, dating back to June, 1911... (p.1) This presentation will describe Cornell University Library efforts to provide an "afterlife" to The Cornell Veterinarian by leveraging a number of disparate initiatives and metadata sources. While attempting to build article level linking to full-text in HathiTrust (functionality currently unavailable), limitations in the metadata captured during the scanning process were uncovered. The speaker will delineate these metadata findings and provide strategies (some scalable, others highly labor intensive) for gathering the necessary metadata for creating direct links to articles found in HathiTrust. Presenter: Steven Folsom Cornell University Steven Folsom is a metadata librarian overseeing the creation and management of metadata for various Cornell University Library digital platforms. He strategizes on the integration of metadata across systems with the ultimate goal of improving discovery and access of information resources.TRANSCRIPT
Wrangling Metadata from HathiTrust and PubMed:Providing Full-text Linking to The Cornell Veterinarian
Photo credit: http://www.walls.com/Steven Folsom, NASIG Annual Conference 2014
Cornell Library Digital Consulting and Production Services
A single-point of service for those wishing to create digital collections
A virtual group that spans multiple departments within the Library (Digital Scholarship and Preservation Services, Cornell Library IT and Metadata Librarians from Library Technical Services)
Approaches digital collection building holistically, and addresses the entire life cycle management of a project
Steven Folsom, NASIG Annual Conference 2014
The Cornell Veterinarian Project Participants
Client:
Cornell Flower-Sprecher Veterinary Library
DCAPS Involvement:
Jaron Porciello, Digital Scholarship Initiatives Coordinator
Michelle Paolillo, Project Manager/Business Analyst (CUL’s HathiTrust Liaison)
John Cline, Cornell Library Programmer
Steven Folsom, Metadata LibrarianSteven Folsom, NASIG Annual Conference 2014
HathiTrust Digital Library
Digital Library consisting of the Google Books project, Internet Archive digitization initiatives, and content digitized locally by libraries
Committed to preserving content with stable access and distributed/coordinated cost of storage
Centralized technical framework with that allows for the creation of tools and services
Steven Folsom, NASIG Annual Conference 2014
The Cornell Veterinarian
Steven Folsom, NASIG Annual Conference 2014
The Challenge
Steven Folsom, NASIG Annual Conference 2014
Hathi Volume Interface
Steven Folsom, NASIG Annual Conference 2014
Google Books: Contributions from Cornell Library
Participation in the Google Books Library Project since 2008
Google focuses on materials that they have not already digitized
Using OCLC holdings information, they compose a Cornell candidate list
Steven Folsom, NASIG Annual Conference 2014
HathiTrust Data API
Steven Folsom, NASIG Annual Conference 2014
Hathi METS File
Steven Folsom, NASIG Annual Conference 2014
METS File Continued
Steven Folsom, NASIG Annual Conference 2014
Hathifiles
Tab-delimited full files of the Hathi Digital Library and incremental updates (Full file is currently over 2.5 GB uncompressed)
Light Bibliographic data
Includes some administrative metadata, e.g. rights information, the originating institution for the scanned copy
Steven Folsom, NASIG Annual Conference 2014
Select Hathifile Record Elements
Hathi Volume ID: mdp.39015076694507
Access: allow [Notes on mapping for rights attributes where contextual user data would affect access]
Rights: pd [public domain]
HathiTrust record number: 000529434
Enumeration/Chronology: v.33 no.11 1900
Source: MIU
Title: The Chicago medical times
OCLC number: 1554176 Steven Folsom, NASIG Annual Conference 2014
HathiTrust Bibliographic API
Meant for use to retrieve information about small numbers of items at a time
Returns bibliographic, rights, and volume information when given a single or multiple standard identifiers (ISBN, LCCN, OCLC, etc.), includes overlap with the Hathifile data
Brief example: http://catalog.hathitrust.org/api/volumes/brief/oclc/424023.json
Full example:http://catalog.hathitrust.org/api/volumes/full/oclc/424023.jsonSteven Folsom, NASIG Annual Conference
2014
Hathi Metadata Recap
Data API
• Administrative data about scans and corresponding volumes
• Uses Hathi id’s to link to bibliographic data
HathiFiles
• Bulk Bibliographic data
• Some administrative data, e.g. Rights information
BIB API
• Small requests for Bibliographic data retrieved using standard identifiers (ISBN, LCCN, OCLC…)
Steven Folsom, NASIG Annual Conference 2014
What we thought was the solution….
Use Hathi Data API to find Table of Contents for each Volume
Gather the related OCR
Parse out article citation values from the OCR (Hopefully in a mostly automated way)
Use the pagination data from TOC to build links by mapping to pagination in the METS files.
What couldn’t be automated would be done manually
(with the projected outcome being an citation index with Hathi URLs that could be used to build an interface or given to an index like PubMed)
Steven Folsom, NASIG Annual Conference 2014
Reality set in…
Steven Folsom, NASIG Annual Conference 2014
Photo credit: ehive.com
HathiTrust OCR
Steven Folsom, NASIG Annual Conference 2014
The metadata continued to fight back…
Photo credit: http://glpiggy.net/ Steven Folsom, NASIG Annual Conference 2014
PubMed Indexing and API
Steven Folsom, NASIG Annual Conference 2014
A Path for Automation
For each citation already in PubMed for which the HathiTrust has one volume
1. Search PubMed <Volume> AND the Hathi Catalog id (000535347) for The Cornell Veterinarian against the Hathi File to get the corresponding Hathi object id from the METS
2. Use the METS object id AND the PubMed start page (the numeric value before the ‘-“ for each PubMed article citation to find the <ORDERLABEL> to get the <Order> number from the METS file
3. Create the URL to be added to the PubMed XML. The Hathi METS object id and <Order> number are used to create the URL. The sequence number in this URL equals the <Order> number. The METS id equals the id in the URL, http://babel.hathitrust.org/cgi/pt?id=coo.31924051143075;view=1up;seq=11 Steven Folsom, NASIG Annual Conference 2014
NCBI’s LinkOut Program
A service that allows third parties to link specific NCBI database records to relevant web-accessible resources
The relevant journal/publication must already have gone through the Medline selection process
Document Type Definition (DTD) for contributing links in XML
Steven Folsom, NASIG Annual Conference 2014
PubMed Citation Data Requirements
PubMED DTD specifies how the data should be formatted Data Tags (R = Required, O = Optional O/R = Optional
or Required). Required tags must be included; optional tags must be included only if the data requested appears in the print or electronic article. Optional or Required tags are dependent on the use of other tags
Tag names are case sensitive
Steven Folsom, NASIG Annual Conference 2014
PubMed Citation Data Elements
File Header (R)ArticleSet (R)Article (R)Journal (R)PublisherName (R)JournalTitle (R)Issn (R)Volume (O/R)Issue (O/R)PubDate (R)Year (R)Month (O/R)Season (O)
Day (O) Replaces (O)ArticleTitle (O)VernacularTitle (O)FirstPage (O/R)LastPage (O)ELocationID (O/R)Language (O)AuthorList (O/R)Author (R)FirstName (O/R)MiddleName (O)LastName (O/R)Suffix (O)CollectiveName (O)
Affiliation (O)Identifier (O)GroupList (O/R)Group (R)GroupName (R)IndividualName (O) PublicationType (O)ArticleIdList (O/R)ArticleId (R)History (O)Abstract (O)OtherAbstract (O)CopyrightInformation (O)ObjectList (O)Object (O)Param (O)
Steven Folsom, NASIG Annual Conference 2014
In an Ideal World…
Steven Folsom, NASIG Annual Conference 2014Photo credit: http://www.priefert.com/
The metadata that got away…
Pre-1945 issues not indexed by PubMed
Supplemental volumes*
What we hope to do about it:
Manually capture the Hathi URL’s for the supplemental volumes and provide them to PubMed using their linking format
Manually capture citation data for pre-1945 articles using the OCR files, and send to PubMed using their indexing format.
Steven Folsom, NASIG Annual Conference 2014
Project OutcomesSoft:
Better understanding of what’s possible with Hathi API’s
Better understanding of PubMed’s metadata/URL contribution requirements
Increased desire within the Cornell Library to consider greater return on our HathiTrust investment
Concrete:
The Cornell Veterinarian should be available via PubMed for the years already indexed soon
Manually capturing the complete backfile for The Cornell Veterinarian to contribute to PubMed
Steven Folsom, NASIG Annual Conference 2014
Future Considerations
Potential for improved access to other titles currently lacking full-text linking in PubMed [if in HathiTrust]
Investigations into other (non)full-text indexes and fulltext repositories
New Services for interacting with HathiTrust Digital Library
Potential improvements to the Hathi workflows.
Steven Folsom, NASIG Annual Conference 2014
Questions?
Steven Folsom, NASIG Annual Conference 2014Photo credit: ehive.com