from data to discovery building automated cataloguing tools with perl huw jones cambridge university...

From Data to Discovery

Building Automated Cataloguing Tools with Perl

Huw Jones

Cambridge University Library

Small city, big University = lots of libraries!

Cambridge

Lots of libraries = lots of books

Bibliographic records

University Library: 3.85 M

Other libraries: 2.5 M

8 databases

Data problems

Quality

Duplication

Quality - fullness

of 2.5 M records in our databases

1 M are short records

Quality – coding

Duplication

Effects

• Difficulty in resource discovery

• Patchy retrieval

• Lack of authority control

• Difficulty with standard deduplication

• Burden on staff time

• Ties us to multiple database model

Aims

Better records

Fewer records

Existing Solutions?

• Manual recataloguing

• Commercial solutions

• Universal catalogue

• Discovery layer

Either don’t solve the core problem, or expensive and/or time consuming

Our solution

Automated Cataloguing Tools!

• Short record enrichment• Automated MARC correction• Deduplication

Order important – full, well coded records are easier to deduplicate

General principles

• Retrieve some records from a Voyager database

• Examine and/or manipulate them

• If necessary, make changes in the database

N.B. Watch indexes and table space!

General tools

• Perl – holds everything together

• Perl DBI – connects to databases

• SQL – retrieves records from database

• MARC::Record modules (from CPAN) – to examine/manipulate records

• Pbulkimport/Batchcat – to make changes to the database

Batchcat vs Pbulkimport

• Batchcat – installed on PC with Voyager

• More versatile

• Can’t be used on server

• Pbulkimport – limited functionality

• Needs Bibliographic Detection Profile and Bulk Import Rule (SYSADMIN)

• Can be used on server

Books

• Learning Perl / Randal L. Schwartz and Tom Phoenix. 3rd ed. (Sebastopol, Calif. : O’Reilly, 2001). ISBN: 0596001320

• Programming the Perl DBI / Alligator Descartes and Tim Bunce. (Sebastopol, Calif. : O’Reilly, 2000). ISBN: 1565926994

Enriching short records

How to get from this …

to this

Basic mechanism

• Take short record

• Find a matching full record

• Overlay short record with full record

• Need a source of full records

• In Cambridge - University Library - large database of full, authority controlled records

Connects to EXTERNAL source. Finds best FULL RECORD match and scores it

Connects to LOCAL database and checks if a valid bib id

Retrieves SHORT RECORD info from local database

File of SHORT RECORD bib ids

Compares match score to overlay threshold. If OK, retrieves MARC record for FULL RECORD

Corrects FULL MARC record. Removes inappropriate fields. Inserts fields to be retained from SHORT RECORD

In local database overlays SHORT RECORD with FULL RECORD

Output

Interface

Results

• Service has been running for 1 year (much of which was testing)

• 18 libraries subscribed to use service

• 90,000 short records upgraded

MARC checking and correction

• Bibliographic standard – agreed minimum standard for cataloguing

• Every week, libraries receive an automatically generated file of MARC coding errors for correction

• Based on MARC::Lint module with many alterations

Output

Mechanism

• Connects to database using Perl DBI• Retrieves MARC record for records

created/edited in last week• Runs them through MARC check• Prints errors to file• Emails file to library

Over 100,000 errors pointed out so far!

MARC Correction

How to get from this …

• =LDR 00472nam\\2200157\a\4500• =001 662002• =005 20071205064734.0• =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d• =020 \\$a9780961751111• =100 1\$aBroecker, W.S.,$d1931-• =245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker.• =260 \\$aNew York ;$bEldigio Press,$cc1985• =300 \\$a291p $bill $c23cm• =504 \\$aIncludes index.• =650 \0$aAstronomy.• =650 \0$aAstrophysics.

to this!

• =LDR 00453nam 2200157 a 4500• =001 662002• =005 20071205064734.0• =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d• =020 \\$a9780961751111• =100 1\$aBroecker, W. S.,$d1931-• =245 10$aHow to build a habitable planet /$cby Wallace S. Broecker.• =260 \\$aNew York :$bEldigio Press,$cc1985.• =300 \\$a291 p. :$bill. ;$c23 cm.• =504 \\$aIncludes index.• =650 \0$aAstronomy.• =650 \0$aAstrophysics.

MARC Correction

• Version of module which, where there is no ambiguity, corrects errors

• Built into short record upgrade program

• Also offered as a retrospective service to clean up legacy records

• Possibility of building it into weekly check

Mechanism

• Connects to database using Perl DBI

• Retrieves full MARC record

• Runs against correction module

• Replaces corrected record in database

Output

• Bib id: 662002• How to build a habitable planet ; By Wallace S. Broecker.• 100: UPDATE: Spaces inserted between initials in subfield _a• 245: UPDATE: By uncapitalised at start of subfield c• 245: UPDATE: Space forward slash inserted before subfield _c• 260: UPDATE: Full stop inserted at end of field• 260: UPDATE: Space colon inserted before subfield _b• 300: UPDATE: Full stop inserted after the p in pagination• 300: UPDATE: Full stop inserted at end of field• 300: UPDATE: Illustration abbreviation has been corrected• 300: UPDATE: Space colon inserted before subfield _b• 300: UPDATE: Space inserted between digits and cm• 300: UPDATE: Space inserted between digits and p in pagination• 300: UPDATE: Space semi-colon inserted before subfield c

Results

• In testing 70,000 records processed

• Corrected over 200,000 MARC coding errors

• May run ALL our existing records through at some stage

Deduplication – in progress!

Three stages:

• Identification of groups of duplicates

• Identification/construction of ‘best’ record

• Deletion of other records – relinking of holdings/items/Purchase Orders to ‘best record’

Identification of duplicates

• Connect to a database with Perl DBI

• Use SQL to retrieve records

• For each record, retrieve all available data from tables

• Use matching algorithm to identify groups of duplicates

And you’ll end up with something like this:

Identification of best record

• For each of group of duplicates, MARC records retrieved

• Passed to scoring algorithm

• Record with highest score forms basis of ‘best’ record

• Retains set fields (i.e. subject headings) from ‘other’ records

• Corrects any MARC coding errors

But …

• No relinking functionality, even in BatchCat

• No viable workaround for libraries using Acquisitions/without losing circulation history

In conclusion …

• Tools for librarians, not replacements!

• Do the stuff programs do well, allowing humans to concentrate on what humans do well

• Won’t do all the work, just makes a solution to major data problems feasible

Questions?

from data to discovery building automated cataloguing tools with perl huw jones cambridge university...

Documents

record slide

database slide

short records slide

cambridge slide

databases slide

server slide

interface slide

output slide