data wrangling: mscs view from the trenches

21
Data Wrangling: MSCS View from the trenches What we've learned Where we failed How we succeeded

Upload: mainesharedcollections

Post on 12-Jan-2015

340 views

Category:

Technology


1 download

DESCRIPTION

Presentation slides from MSCS Systems Librarian Sara Amato's ALA ALCTS Pre-Conference presentation in Chicago, IL on June 27, 2013.

TRANSCRIPT

Page 1: Data Wrangling: MSCS View from the trenches

Data Wrangling: MSCS View from the trenches

What we've learnedWhere we failed

How we succeeded

Page 2: Data Wrangling: MSCS View from the trenches

You do what?

Liaisoning between tech services, project team, and vendors on data manipulation and display

Skills: Marc and ILS data migration/manipulation Nitty Gritty details – hows and whys Knowledge sharing between partners Investigations and Implementations Project management Meeting management

Page 3: Data Wrangling: MSCS View from the trenches
Page 4: Data Wrangling: MSCS View from the trenches
Page 5: Data Wrangling: MSCS View from the trenches
Page 6: Data Wrangling: MSCS View from the trenches

Data driven? Start at the end!

What do you really want to know? Do you have the data to answer that? What are you going to do with the data What is interesting vs. what is actionable Test out your theories!!

Page 7: Data Wrangling: MSCS View from the trenches

We Needed Data

Page 8: Data Wrangling: MSCS View from the trenches

Data driven? Start at the end!

Comparisons across institutions – match points

Started with an OCLC reclamation project

Records Sent Returned Unresolved Updated OCLC #Ursus 2,100,299 13,232 171,474Colby 474,438 373 26,334Bowdoin 624,164

37,848Bates 656,926 25,101

TOTALS 3,855,827 13,605260,757

Page 9: Data Wrangling: MSCS View from the trenches

Start at the end...if your ordering out

Think about what you want to get back, make sure it goes out.

HOW will you deal with returned data? Can all the partners do the same things in terms

of processing?

Page 10: Data Wrangling: MSCS View from the trenches

Lists, lists, lists!What will you in/exclude if you are extracting:

types: gov docs, serials, media, e-resources

locations: ref, off-site, reserve, special collections

status: billed, missing, suppressed, withdrawn (!)

use: circ, internal use, reserves

What constitutes a circulating copy?

How are the above encoded? Can you get what you want?

Page 11: Data Wrangling: MSCS View from the trenches

Circ Data

How long has it been retained? Any tech processing that included circing? Has it ever been cleared? (… and what does it really tell you ...)

Page 12: Data Wrangling: MSCS View from the trenches

Know your vendor / programmer What exactly is going to happen to the data,

and what will be in(ex)cluded? Leader bib level m , s Gov Doc? (008 / 28) ? Printed material? Media?

Page 13: Data Wrangling: MSCS View from the trenches

So, you think you know your data...

Page 14: Data Wrangling: MSCS View from the trenches

Can you get it out?Export Tables

What exactly is exported What do they do with weird data? (b b, b 930) Do the add any data? v.v.29 , oclc prefix Formats of dates

Page 15: Data Wrangling: MSCS View from the trenches

Your data may vary

35109002285482 3510900228549

Page 16: Data Wrangling: MSCS View from the trenches

Document!!! REALLY!!!

Export tables and field mappings Locations List creation criteria Record ranges exported and dates Files

Page 17: Data Wrangling: MSCS View from the trenches

… a few of the ugly things we saw...

Multiple fields used for internal use (INTL USE, COPY USE, and IUSE3)

Records with multiple 001s Records with multiple barcodes, duplicate

barcodes, bound with items Barcodes in 949 not 'b' Records with no 260 3 0000003 ocm3 3_

Page 18: Data Wrangling: MSCS View from the trenches

Your data through different lenses

Points of departure:

-Merged 001s-FRBR-Volume vs Title counts-Unique vs Holdings counts-Date of data used-Definition of public domain

Page 19: Data Wrangling: MSCS View from the trenches

When things go wrongMarcEdit is your friend!

Page 20: Data Wrangling: MSCS View from the trenches

One more reason to thank Terry Reese

SELECT T0xx.field_data FROM T0xx, T9xx WHERE T9xx.field = '945' AND T9xx.subfield = "f" AND T9xx.field_data > 0 AND T0xx.cid = T9xx.cid AND T0xx.field = '001'

Page 21: Data Wrangling: MSCS View from the trenches

Data Wrangling: MSCS Side

Closing Haiku:

Data is messyWhile it can be normalized

Nothing is perfect