#mashcat: evolving marcedit: leveraging semantic data in marcedit
TRANSCRIPT
#mashcat:Evolving MarcEditLEVERAGING SEMANTIC DATA IN MARCEDIT
Little History MarcEdit Development started around 1999ish (as parts)
◦ Originally coded in 3 programming languages: Assembler (libraries), Visual Basic (UI) and Delphi (COM). ◦ I started writing it as an undergraduate to better understand MARC & circumvent OCLC’s Passport for
Windows program◦ First “MarcEdit” was released Sept. 11, 2000 (thank you WayBack Machine:
http://web.archive.org/web/20001017105529/http://ucs.orst.edu/~reeset/marcedit/indexb.html)
Today:◦ Written in C# (Windows/Linux) & Object-C/C# (OSX)◦ Active user community is ~20,000ish (based on update logs)◦ Used in ~190ish countries/political regions
◦ Roughly 1/3 of the users reside outside of Canada/United States*
* Based on loose analysis of server logs by my server-side stats software
MarcEdit Evolution MarcEdit 1.0-2.0 Main Window MarcEdit MARC Tools 1.0-2.0
MarcEdit 1.0-2.0 MarcEditor
MarcEdit EvolutionEarly application was developed to (again, thank you Internet Archive):
1. Be user-friendly (whether I’ve accomplished that is debatable – I’m not a UI designer)2. Support LC’s MARCBreakr/Maker diacritics (largely yes)3. Be fast (which I think that it is)4. Simplify editing records in batch 5. Provide a set of programming tools to solve my own local needs
MarcEdit Today
Three development rules I follow MarcEdit is a real-world metadata tool
◦ Tool is designed to provide workflows for data problems currently facing libraries right now
MarcEdit is MARC Agnostic◦ Too many metadata tools are anglo-centric; MarcEdit has been designed to work within the very
heterogeneous metadata environment that we find ourselves today, which includes:◦ Support for MARC (not a particular flavor*)◦ Near universal characterset support (because the world is bigger than MARC8 and UTF8)◦ Supports a wide range of Library metadata standards beyond MARC
MarcEdit is one part of the larger library metadata tooling environment◦ So integrations with OCLC, ILSs (when possible), OpenRefine are important
* And if something assumes MARC21 – call it out
So how does any of this relate to semantic data in Libraries?
http://musictheorysite.com/img/dwight_question.jpg
A lot of metadata people I talk to fall into two camps
BibFrame and Linked Data as RDA 2.0
BibFrame
http://www.wired.com/wp-content/uploads/archive/news/images/full/duke_nukem_frever_f.16807.jpghttp://astronomy.nmsu.edu/cwc/Group/magiicat/images/magiicat-logo.gif
Linked Data
BibFrame and linked data as datacorns
https://whatsthebigdata.files.wordpress.com/2015/10/datascience_unicorn.png?w=640
I prefer a more practical outlook…
https://www.etsy.com/search?q=unicorn+cat+hat
MarcEdit’s MARCNext MarcEdit’s MARCNext is a first attempt to start having this discussion by:
1. Integrating a linked data framework into MarcEdit, including tooling for:
a. JSON-LDb. SPARQLc. RDF
2. Providing catalogers with proof of concept tools to begin experimenting with their own data
3. Provide a method to integrate semantic concepts into legacy data
4. Provide a toolset that MarcEdit can use to build new tools.
Let’s take a closer look at two Link Identifiers Tool
◦ This tool embeds URIs into MARC data◦ Is rules driven (i.e., not MARC21 centric)◦ Supports ~24 different in-use data sources
Validate Headings Tool◦ First tool in MarcEdit to make use of the tools linked data platform and available data services to provide
a real-world application.
Link Identifiers Tool
Link Identifiers Tool Initially released in Aug. 2014[1] as a proof of concept for testing the linked data framework being developed in MarcEdit
◦ Initially only processed LCSH and NAF
Currently, I’ve profiled ~24 data sources, and the tool can be integrated in MarcEdit’s Task Workflow.
◦ Translation profiles are currently in flux, as I work with a PCC group developing recommendations for embedding URIs in MARC records.
◦ Working on a process that would allow users to self-profile identifier services, so long as they supported JSON-LD or SPARQL.
[1] MarcEdit’s Research Toolkit: MARCNext: http://blog.reeset.net/archives/1359
Link Identifiers Tool Tool has evolved over the last year to utilize a rules based configuration (example):
<field type="bibliographic"> <tag>630</tag> <ind2 value="0" vocab="naf_lcsh" /> <ind2 value="1" vocab="lcshac" /> <ind2 value="2" vocab="mesh" /> <subfields>adfkqnp</subfields> <uri>0</uri> <special_instructions>mixed</special_instructions> </field> <field type="authority|bibliographic"> <tag>336</tag> <subfields>a</subfields> <index>2</index> <uri>0</uri> </field>
Linked Identifiers: Turning strings
=336 \\$atext$btxt$2rdacontent
=337 \\$aunmediated$bn$2rdamedia
=338 \\$avolume$bnc$2rdacarrier
=600 10$6880-06$aHu, Zongnan,$d1896-1962$vDiaries.
=650 \0$aGenerals$zChina$vBiography.
=650 \0$aGenerals$zTaiwan$vBiography.
=600 17$aHu, Zongnan,$d1896-1962.$2fast$0(OCoLC)fst00131171
=650 \7$aGenerals.$2fast$0(OCoLC)fst00939841
=651 \7$aChina.$2fast$0(OCoLC)fst01206073
=651 \7$aTaiwan.$2fast$0(OCoLC)fst01207854
=655 \7$aDiaries.$2lcgft
=655 \7$aAutobiographies.$2lcgft
Linked Identifiers: into strings+ =336 \\$atext$btxt$2rdacontent$0http://id.loc.gov/vocabulary/contentTypes/txt
=337 \\$aunmediated$bn$2rdamedia$0http://id.loc.gov/vocabulary/mediaTypes/n
=338 \\$avolume$bnc$2rdacarrier$0http://id.loc.gov/vocabulary/carriers/nc
=600 10$6880-06$aHu, Zongnan,$d1896-1962$vDiaries.$0http://id.loc.gov/authorities/names/n84029846
=650 \0$aGenerals$zChina$vBiography.$0http://id.loc.gov/authorities/subjects/sh2008105087
=650 \0$aGenerals$zTaiwan$vBiography.$0http://id.loc.gov/authorities/subjects/sh2008105117
=600 17$aHu, Zongnan,$d1896-1962.$2fast$0http://id.worldcat.org/fast/00131171
=650 \7$aGenerals.$2fast$0http://id.worldcat.org/fast/00939841
=651 \7$aChina.$2fast$0http://id.worldcat.org/fast/01206073
=651 \7$aTaiwan.$2fast$0http://id.worldcat.org/fast/01207854
=655 \7$aDiaries.$2lcgft$0http://id.loc.gov/authorities/genreForms/gf2014026085
=655 \7$aAutobiographies.$2lcgft$0http://id.loc.gov/authorities/genreForms/gf2014026047
Example
Linked Data tools Things that are still hard:
◦ Most identifier services use their own rules for data escaping – and they aren’t documented
◦ Many services are still not well suited for this work◦ Anything that doesn’t provide an option to do an exact lookup like ULAN, AAT, or VIAF – all these require additional
processing to ensure that results match the queried term.
◦ Many services are little “p” production in that lots of look-ups can (and do) cause problems.
Validate Headings Automated authority control processing
◦ Utilizes id.loc.gov◦ Provides reports of data that isn’t currently “authorized”◦ Provides options for generating brief authorities◦ Extracts for further data processing◦ Ability to embed URIs during validation
◦ If URIs are present – they are used rather than a direct look up◦ Automatic heading correction when variants are encountered
Validate Headings
Validate Headings can be run from inside the MarcEditor, or outside as a stand alone tool
Example
Continued work… Would like to continue to add additional vocabularies
Expand headings validation to more than just LCSH/NAF
Include Linking Profiles for UNIMARC
Using Linked Data sources for sameas subject generation
Questions Contact Information:
Terry ReeseEmail: [email protected] or [email protected] Website: http://marcedit.reeset.netHelp: http://marcedit.reeset.net/help