wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell...

29
Wrangling Metadata from HathiTrust and PubMed: Providing Full-text Linking to The Cornell Veterinarian Photo credit: http://www.walls.com/ Steven Folsom, NASIG Annual Conference 2014

Upload: nasig

Post on 11-May-2015

2.902 views

Category:

Education


3 download

DESCRIPTION

In the January 1994 issue of The Cornell Veterinarian editor Maurice E. White wrote: THIS is the last issue of "The Cornell Veterinarian". The "Cornell Vet" has a proud history, dating back to June, 1911... (p.1) This presentation will describe Cornell University Library efforts to provide an "afterlife" to The Cornell Veterinarian by leveraging a number of disparate initiatives and metadata sources. While attempting to build article level linking to full-text in HathiTrust (functionality currently unavailable), limitations in the metadata captured during the scanning process were uncovered. The speaker will delineate these metadata findings and provide strategies (some scalable, others highly labor intensive) for gathering the necessary metadata for creating direct links to articles found in HathiTrust. Presenter: Steven Folsom Cornell University Steven Folsom is a metadata librarian overseeing the creation and management of metadata for various Cornell University Library digital platforms. He strategizes on the integration of metadata across systems with the ultimate goal of improving discovery and access of information resources.

TRANSCRIPT

Page 1: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Wrangling Metadata from HathiTrust and PubMed:Providing Full-text Linking to The Cornell Veterinarian

Photo credit: http://www.walls.com/Steven Folsom, NASIG Annual Conference 2014

Page 2: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Cornell Library Digital Consulting and Production Services

A single-point of service for those wishing to create digital collections

A virtual group that spans multiple departments within the Library (Digital Scholarship and Preservation Services, Cornell Library IT and Metadata Librarians from Library Technical Services)

Approaches digital collection building holistically, and addresses the entire life cycle management of a project

Steven Folsom, NASIG Annual Conference 2014

Page 3: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

The Cornell Veterinarian Project Participants

Client:

Cornell Flower-Sprecher Veterinary Library

DCAPS Involvement:

Jaron Porciello, Digital Scholarship Initiatives Coordinator

Michelle Paolillo, Project Manager/Business Analyst (CUL’s HathiTrust Liaison)

John Cline, Cornell Library Programmer

Steven Folsom, Metadata LibrarianSteven Folsom, NASIG Annual Conference 2014

Page 4: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

HathiTrust Digital Library

Digital Library consisting of the Google Books project, Internet Archive digitization initiatives, and content digitized locally by libraries

Committed to preserving content with stable access and distributed/coordinated cost of storage

Centralized technical framework with that allows for the creation of tools and services

Steven Folsom, NASIG Annual Conference 2014

Page 5: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

The Cornell Veterinarian

Steven Folsom, NASIG Annual Conference 2014

Page 6: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

The Challenge

Steven Folsom, NASIG Annual Conference 2014

Page 7: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Hathi Volume Interface

Steven Folsom, NASIG Annual Conference 2014

Page 8: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Google Books: Contributions from Cornell Library

Participation in the Google Books Library Project since 2008

Google focuses on materials that they have not already digitized

Using OCLC holdings information, they compose a Cornell candidate list

Steven Folsom, NASIG Annual Conference 2014

Page 9: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

HathiTrust Data API

Steven Folsom, NASIG Annual Conference 2014

Page 10: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Hathi METS File

Steven Folsom, NASIG Annual Conference 2014

Page 11: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

METS File Continued

Steven Folsom, NASIG Annual Conference 2014

Page 12: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Hathifiles

Tab-delimited full files of the Hathi Digital Library and incremental updates (Full file is currently over 2.5 GB uncompressed)

Light Bibliographic data

Includes some administrative metadata, e.g. rights information, the originating institution for the scanned copy

Steven Folsom, NASIG Annual Conference 2014

Page 13: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Select Hathifile Record Elements

Hathi Volume ID: mdp.39015076694507

Access: allow [Notes on mapping for rights attributes where contextual user data would affect access]

Rights: pd [public domain]

HathiTrust record number: 000529434

Enumeration/Chronology: v.33 no.11 1900

Source: MIU

Title: The Chicago medical times

OCLC number: 1554176 Steven Folsom, NASIG Annual Conference 2014

Page 14: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

HathiTrust Bibliographic API

Meant for use to retrieve information about small numbers of items at a time

Returns bibliographic, rights, and volume information when given a single or multiple standard identifiers (ISBN, LCCN, OCLC, etc.), includes overlap with the Hathifile data

Brief example: http://catalog.hathitrust.org/api/volumes/brief/oclc/424023.json

Full example:http://catalog.hathitrust.org/api/volumes/full/oclc/424023.jsonSteven Folsom, NASIG Annual Conference

2014

Page 15: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Hathi Metadata Recap

Data API

• Administrative data about scans and corresponding volumes

• Uses Hathi id’s to link to bibliographic data

HathiFiles

• Bulk Bibliographic data

• Some administrative data, e.g. Rights information

BIB API

• Small requests for Bibliographic data retrieved using standard identifiers (ISBN, LCCN, OCLC…)

Steven Folsom, NASIG Annual Conference 2014

Page 16: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

What we thought was the solution….

Use Hathi Data API to find Table of Contents for each Volume

Gather the related OCR

Parse out article citation values from the OCR (Hopefully in a mostly automated way)

Use the pagination data from TOC to build links by mapping to pagination in the METS files.

What couldn’t be automated would be done manually

(with the projected outcome being an citation index with Hathi URLs that could be used to build an interface or given to an index like PubMed)

Steven Folsom, NASIG Annual Conference 2014

Page 17: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Reality set in…

Steven Folsom, NASIG Annual Conference 2014

Photo credit: ehive.com

Page 18: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

HathiTrust OCR

Steven Folsom, NASIG Annual Conference 2014

Page 19: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

The metadata continued to fight back…

Photo credit: http://glpiggy.net/ Steven Folsom, NASIG Annual Conference 2014

Page 20: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

PubMed Indexing and API

Steven Folsom, NASIG Annual Conference 2014

Page 21: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

A Path for Automation

For each citation already in PubMed for which the HathiTrust has one volume

1. Search PubMed <Volume> AND the Hathi Catalog id (000535347) for The Cornell Veterinarian against the Hathi File to get the corresponding Hathi object id from the METS

2. Use the METS object id AND the PubMed start page (the numeric value before the ‘-“ for each PubMed article citation to find the <ORDERLABEL> to get the <Order> number from the METS file

3. Create the URL to be added to the PubMed XML. The Hathi METS object id and <Order> number are used to create the URL. The sequence number in this URL equals the <Order> number. The METS id equals the id in the URL, http://babel.hathitrust.org/cgi/pt?id=coo.31924051143075;view=1up;seq=11 Steven Folsom, NASIG Annual Conference 2014

Page 22: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

NCBI’s LinkOut Program

A service that allows third parties to link specific NCBI database records to relevant web-accessible resources

The relevant journal/publication must already have gone through the Medline selection process

Document Type Definition (DTD) for contributing links in XML

Steven Folsom, NASIG Annual Conference 2014

Page 23: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

PubMed Citation Data Requirements

PubMED DTD specifies how the data should be formatted Data Tags (R = Required, O = Optional O/R = Optional

or Required). Required tags must be included; optional tags must be included only if the data requested appears in the print or electronic article. Optional or Required tags are dependent on the use of other tags

Tag names are case sensitive

Steven Folsom, NASIG Annual Conference 2014

Page 24: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

PubMed Citation Data Elements

File Header (R)ArticleSet (R)Article (R)Journal (R)PublisherName (R)JournalTitle (R)Issn (R)Volume (O/R)Issue (O/R)PubDate (R)Year (R)Month (O/R)Season (O)

Day (O) Replaces (O)ArticleTitle (O)VernacularTitle (O)FirstPage (O/R)LastPage (O)ELocationID (O/R)Language (O)AuthorList (O/R)Author (R)FirstName (O/R)MiddleName (O)LastName (O/R)Suffix (O)CollectiveName (O)

Affiliation (O)Identifier (O)GroupList (O/R)Group (R)GroupName (R)IndividualName (O) PublicationType (O)ArticleIdList (O/R)ArticleId (R)History (O)Abstract (O)OtherAbstract (O)CopyrightInformation (O)ObjectList (O)Object (O)Param (O)

Steven Folsom, NASIG Annual Conference 2014

Page 25: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

In an Ideal World…

Steven Folsom, NASIG Annual Conference 2014Photo credit: http://www.priefert.com/

Page 26: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

The metadata that got away…

Pre-1945 issues not indexed by PubMed

Supplemental volumes*

What we hope to do about it:

Manually capture the Hathi URL’s for the supplemental volumes and provide them to PubMed using their linking format

Manually capture citation data for pre-1945 articles using the OCR files, and send to PubMed using their indexing format.

Steven Folsom, NASIG Annual Conference 2014

Page 27: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Project OutcomesSoft:

Better understanding of what’s possible with Hathi API’s

Better understanding of PubMed’s metadata/URL contribution requirements

Increased desire within the Cornell Library to consider greater return on our HathiTrust investment

Concrete:

The Cornell Veterinarian should be available via PubMed for the years already indexed soon

Manually capturing the complete backfile for The Cornell Veterinarian to contribute to PubMed

Steven Folsom, NASIG Annual Conference 2014

Page 28: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Future Considerations

Potential for improved access to other titles currently lacking full-text linking in PubMed [if in HathiTrust]

Investigations into other (non)full-text indexes and fulltext repositories

New Services for interacting with HathiTrust Digital Library

Potential improvements to the Hathi workflows.

Steven Folsom, NASIG Annual Conference 2014

Page 29: Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

Questions?

Steven Folsom, NASIG Annual Conference 2014Photo credit: ehive.com