a method for aligning museum collections metadata · web viewalthough metadata standards for...
TRANSCRIPT
Running Head: MUSEUM COLLECTIONS METADATA AND LINKED DATA
A Method for Aligning Museum Collections Metadata with Standards, Using Linked Data
Case Study: The [name withheld] Museum Collection Subject Vocabulary
Susan EdwardsMay 13, 2013
SJSU – School of Library and Information ScienceLIBR 281 - Metadata
MUSEUM COLLECTIONS METADATA AND LINKED DATA 2
Museums have been working together for several decades to find a way to share and
aggregate their collections data (Waibel, 2010). A key barrier to success has been the unique,
idiosyncratic nature of most museum collections metadata schemata. Although metadata
standards for describing cultural heritage objects and works of art do exist, museums have not
accepted a single standard that would facilitate universal interoperability (Coburn et. al., 2010;
Koolen, Kamps & De Keijzer, 2009; Waibel, 2010). In addition, museums tend to develop local
vocabularies tailored to the unique nature of a specific collection. Thus, sharing data and
searching across collections is a challenge not only because the structure of the data differs from
institution to institution, but because the very language used to describe their collections also
differs. Linked open data (LOD) may offer a solution to the problem of museum metadata
aggregation and sharing, precisely because LOD can facilitate interoperability without requiring
standard schemata and vocabularies.
The case study described in this paper applies and evaluates one technique for using LOD
to increase the interoperability of museum collections data. Developed by the Free Your
Metadata (FYM) project (http://freeyourmetadata.org/about/) (van Hooland, et. al., 2013), the
method involves mapping a local vocabulary to standard controlled vocabularies using linked
data. The authors aimed to demonstrate how reconciling local vocabularies to the linked data
cloud can be used to "derive additional value" from existing metadata (van Hooland, et. al.,
2013, p. 464). This case study applies the FYM method to the X Museum collection (name
withheld; X is a pseudonym). The museum is in the process of updating the presentation of their
collections online and has found that the local vocabulary needs to be updated. The museum is
interested in mapping the local vocabulary to a standard, which would provide more flexibility
for future cataloging and data interoperability, and also look towards a future when the data
MUSEUM COLLECTIONS METADATA AND LINKED DATA 3
available as LOD, or integrated as part of a cross-museum-collections search. By mapping to
standards that are available via LOD, and by adding the relevant LOD metadata to their own
metadata in the process, the museum could take one giant leap towards future interoperability
with other museum collections.
This paper will provide a review of the literature about issues with museum metadata
interoperability, and about the emerging practice of museums using LOD. It will then provide a
report on the application of the FYM methodology to the X Museum collection data, with an
analysis of the results, and recommendations for future work.
Literature Review
The problem of heterogeneous metadata. The sharing of collections data is not new to
museums. Waibel (2010) points out that the history of museum metadata interoperability desires
dates back to 1969 and the establishment of the Museum Computer Network (MCN). A wiki
created and maintained by museum technology professionals and dedicated to discussion and
information sharing about museum collection data, "Museums and the machine-processable
web" (Museum API Wiki) http://museum-api.pbworks.com/, includes a long list of institutions,
including many museums, that make collections data available upon request (Museum APIs,
2013). But the form in which the data is provided, the methods of delivery, and the schemas
utilized, vary widely. Most museums have created their own unique APIs to provide access, or
use the Open Archives Initiative Protocol for Machine Harvesting (OAI-PMH) to provide raw
data. The plethora of proprietary APIs for accessing museum collections data point to a
longstanding issue in this sector: museums simply don't have standard, shared metadata schemas,
vocabularies, or description frameworks, for their collections.
MUSEUM COLLECTIONS METADATA AND LINKED DATA 4
Chan (2012) and Elings and Waibel (2007) contrast the nature of museum collections
with those of archives and libraries in order to explain this situation. Libraries and archives,
which have been at the forefront of metadata standards development, have standard methods of
describing their collections via schemas, cataloging rules, and vocabularies, such as AACR,
LCSH, MARC, Dublin Core, and DACS, which every institution uses. But because museums
have very diverse collections of unique objects, they have not been set up in the same way as
libraries, and have often described their collections in an idiosyncratic, inconsistent manner.
Many have developed their own internal description standards and vocabularies, and have even
applied those standards inconsistently over time (Waibel, 2010). Barker (2012) argues that the
fundamental philosophy of museums is different from libraries: libraries have focused on access
to resources and do little interpretation, while museums tend to focus on collecting and providing
interpretation about their objects. White (2012) echoes this sentiment, pointing out that the
varied nature of museum data is a manifestation of every museum's aspiration to be seen as
distinct through its own collection. Thus, museum metadata is so heterogeneous and
idiosyncratic from institution to institution that a major interoperability problem is presented
when institutions begin talking about sharing data (Chan, 2012; Henry and Brown, 2012; Isaac,
Clayphan, & Haslhofer, 2012; Koolen, Kamps, & De Keijzer, 2009; Waibel, 2010).
In the past eight years, much effort has been devoted to the various metadata schemata
for cultural heritage and visual culture objects, with the aim of providing a single standard
vehicle for data aggregation and sharing. In 2005, Scruton pointed to the findings of the FAIR
(Focus on Access to Institutional Resources) program in the UK, a project to experiment with the
viability of metadata harvesting, which found that Dublin Core was an inadequate metadata
schema for museum collections. In 2007 an issue of the Visual Resources Association Bulletin
MUSEUM COLLECTIONS METADATA AND LINKED DATA 5
was dedicated to exploring the sticky issues around metadata schemas for visual culture,
rehashing the problem of heterogeneous data, analyzing the appropriateness of the various
schemata for data exchange, and arguing the benefits of standards for interoperability. Several
authors analyzed the applicability of various metadata schemata for museums—from Categories
for the Description of Works of Art (CDWA), to Cataloguing Cultural Objects (CCO), to the
Visual Resources Association Core metadata (VRA Core) (Baca, 2007; Elings and Waibel, 2007;
Kessler, 2007). Kessler (2007) acknowledged that, "a truly viable mechanism for shared
cataloging continues to elude the visual resources community" (p. 20).
Deciding on a standard metadata schema, difficult as that is, does not seem to be the only
issue. The real problem is the data itself, as Chan (2012) states bluntly, "… the truth is that our
[the museum sector's] data sucks." For example, Ridge (2012), in her examination of the data
from a single collection, discovered that the application of inconsistent metadata standards over
time was the main barrier to working with the data. Waibel (2010) discusses the Museum Data
Exchange project, an OCLC funded project started in 2007 to lower the barrier to metadata
exchange in museums. The project quickly ran up against issues with the quality of the data itself
—inconsistently applied terms, a lack of standard vocabularies, and missing information. Isaac,
Clayphan, and Haslhofer (2012) also describe similar issues found with museum data through the
European Union's Europeana project (http://www.europeana.eu/), a digital library of European
cultural heritage that makes digitized collections from more than 2,000 institutions across Europe
—primarily libraries and archives—available through one portal (Purday, 2009). With such
inconsistency in the underlying data, it would seem that even if a set of standard metadata
schemata could be agreed to in museums, it would take a large effort for these institutions to
reconcile their data sets to that standard.
MUSEUM COLLECTIONS METADATA AND LINKED DATA 6
Linked data to the rescue. Van Hooland et. al (2013), authors of the FYM method,
argue that LOD will solve the problems of heterogeneous metadata in the museum sector. Rather
than focusing on standardizing metadata schemata, they argue that the semantic structure of the
Resource Description Framework (RDF), a metadata format for linked data, is particularly suited
to reconciling local vocabularies with standard vocabularies, which are available as linked data:
The term 'Linked Data' is referenced as a set of best practices to publish and
connect entities (rather than only documents). The ambition is to create a global
data space of networked resources that can be queried with generic tools, rather
than putting publishers and agents in charge of understanding custom APIs of so-
called data-silos. (p. 465)
These authors point out that the Library of Congress Subject Headings (LCSH) is available as
LOD, and the Getty's Art and Architecture (AAT) vocabulary will also soon be available as
LOD. They used data from the Powerhouse Museum's collection to demonstrate that these two
vocabulary standards can be reconciled automatically, using the existing Powerhouse Museum
metadata. This approach, they ague, can provide clear entry points for sharing the museum's data
set in the universe of linked data, and therefore increasing the accessibility via search.
Angjeli et. al. (2009) performed an experiment similar to the FYM project to align the
vocabularies used in two archival collections of illuminated manuscripts using RDF in order to
provide cross-collection search capability. At the time, that experiment required first translating
the two vocabularies into RDF. Today, four years later, many controlled vocabularies are
available in RDF format as linked data. Angjeli et. al. (2009) clearly foresaw the LOD potential
of their experiment: "Another, more recent direction [to solve the problem of heterogeneous
cultural heritage data] is to use the techniques of the semantic web, based on the fact that
MUSEUM COLLECTIONS METADATA AND LINKED DATA 7
controlled vocabularies, as true [Knowledge Organization System] KOSs, can be likened to
ontologies, which stand at the core of the semantic web vision. (p. 26)"
There is currently much activity in the area of RDF-formatted LOD for museum
collections. Searching in Google for "RDF and museums" retrieved several white papers, reports
and articles in open-source publications; as well as blog posts and wikis updated with reports in
the past two to three years. In addition to the Museum API Wiki, several consortiums are
dedicated to research and discussion about LOD in the cultural heritage sector, and specifically
include museums in their purview. They host websites dedicated to sharing information and
collaborating across the sector. These include:
LOD-LAM community (Linked Open Data in Libraries Archives and Museums,
lod-lam.net)
OpenGLAM (Galleries, Libraries, Archives, and Museums), an initiative of the
UK's Open Knowledge Foundation (http://openglam.org/)
STITCH (Semantic Interoperability To access Cultural Heritage), a project from the
CATCH (Continuous Access to Cultural Heritage) program of the Netherlands Organization
for Scientific Research (http://www.cs.vu.nl/STITCH/)
This fairly new approach of the FYM method may provide an entry point for many
museums to step into linked data. The Getty Research Institute has reported that its
vocabularies, which include the Art and Architecture Thesaurus (AAT), the Union List of Artist
Names (ULAN), the Thesaurus of Geographic Names (TGN), and the Cultural Objects Name
Authority (CONA) will be made available as linked data using RDF very soon (Harpring, 2012).
In the world of LOD, perhaps the lack of a standard metadata schema for museums is a non-
issue. The Getty vocabularies are an agreed-upon controlled standard in this sector. These, with
MUSEUM COLLECTIONS METADATA AND LINKED DATA 8
the help of LOD, may effectively become the closest thing to an interoperability tool than this
sector has ever had.
Linked data experiments in museums. Only three museums currently provide their
entire collections data as LOD in RDF format. All three are part of the Europeana project: the
National Gallery, London (Museum API Wiki, 2013); the British Museum, London (Museum
API Wiki, 2013); and the Rijksmuseum in the Netherlands (STITCH, n.d.). Europeana does not
currently use LOD, but began experimenting with LOD in 2011 (Isaac, 2012)1 and developed its
own data model based on RDF for this purpose (Isaac, Clayphan, & Haslhofer, 2012). In
addition to the Europeana project's experiments with RDF, Allinson (2012) discusses an
experiment by the Tate Museum, London as part of the OpenART project in 2011. The museum
provided data about a sub-set of its collection related to the history of the London art world
between 1660 and1735 with the aim of enhancing a collaborative website about the London art
world with semantic data (http://artworld.york.ac.uk/). Henry and Brown (2012) also discuss
experiments at the Missouri History Museum to use RDF to create cross-collections search
within their institution.
The British Museum is a success story on this front. In a blog post summarizing the 2011
JISC conference, Stevenson (2011) reports that the British Museum had a huge problem with
heterogeneous data—seven different datasets developed in different parts of the institution.
These data are now available by LOD, which means that those data are now all interlinked. The
British Museum pointed out that linked data does not solve the problem of inconsistent or
missing data, however, it can help to expose these issues and thus lead to solutions (Stevenson,
2011).
1 The Europeana digital library indicates that it includes data from the Musée du Louvre in Paris, but this author was not able to verify that the Louvre is providing its data as LOD or in RDF format.
MUSEUM COLLECTIONS METADATA AND LINKED DATA 9
In addition to the efforts outlined above, a few museums have also declared their
intention to publish collections as LOD soon. The Museum of the City of New York released a
press release in June of 2012 announcing the intention to release collections data using LOD (but
not specifying whether RDF would be utilized) (Improving…, 2012). In December 2012, the
University of Southern California announced a partnership with the Smithsonian's American Art
Museum to provide LOD for its collections (USC tech experts to guide Smithsonian…, 2012).,
which will use RDF (Goodlander, G., 2013).
Research Methodology
In their peer-reviewed journal article, Van Hooland et. al. (2013) documented the process
used to reconcile collections data from the Powerhouse Museum in Sydney, Australia with
controlled subject vocabularies available as LOD—what this study calls the FYM method. In
simple terms, the FYM method involves uploading the collections data into Open Refine
(http://openrefine.org/, a free tool for data transformation, which was created as an open source
project by Google), profiling and cleaning the data, and then using Open Refine's reconciliation
function to map terms in the data to a linked data source. Once reconciliation is complete, Open
Refine can also be used to create a new field in the metadata to hold the links (URIs) to the LOD,
and the data can be exported in multiple formats, including RDF/XML, RDF as Turtle, HTML,
and Microsoft Excel. By using a free tool, publishing the steps of their process, and also making
these directions available online on the Free Your Metadata website
(http://freeyourmetadata.org), the authors aimed to provide the cultural heritage community with
the tools to link their own data to these controlled standards. The hope is that these controlled
MUSEUM COLLECTIONS METADATA AND LINKED DATA 10
vocabularies could become a hub for cross-linking between collections (Free Your Metadata,
2013).
Key Questions. The current case study aimed to apply the FYM method to map the X
Museum's collections data to two vocabularies: the Library of Congress Subject Headings
(LCSH), and Iconclass (www.iconclass.org). While van Hooland et. al. (2013) provide
quantitative analysis of their results with the Powerhouse Museum collection, we cannot
necessarily extrapolate these results to other museum collections, precisely because every
museum collection and its data are unique. Indeed, the X Museum and Powerhouse collections
are very different—the X Museum has fine art objects from Europe and America, while the
Powerhouse Museum has a diverse historical collection of everyday objects. So, while van
Hooland et. al. (2013) mapped almost 90% of the Powerhouse Museum records to LOD using a
combination of the LCSH and Art and Architecture Thesaurus (AAT) vocabularies, and also
concluded that combining the two vocabularies produced best results, it is not clear that the FYM
method would produce the same results in every situation.
Thus, this case study aims to test the results of van Hooland et. al. with a very different
set of data. Additionally, the X Museum is interested in the results of this case study, as the
information will aid in decisions about whether to proceed with LOD reconciliation as a
approach for adjusting their existing local vocabulary.
The case study began with the following key questions:
Question1: What metadata elements from the X Museum collections are
best used for the reconciliation process?
Question 2: Will the LCSH or Iconclass vocabulary produce better results?
Or would a combination of both vocabularies produce best results?
MUSEUM COLLECTIONS METADATA AND LINKED DATA 11
Question 3: What percentage of records can be mapped to the LOD cloud?
Quetsion 4: How much data manipulation and manual labor is required for
the reconciliation process?
Data Sets and Vocabulary Chosen. In their FYM process, van Hooland et. al. (2013)
used the collection dataset of 75,823 objects made available online by the Powerhouse
Museum, and analyzed terms populated in the 'Categories' metadata element. They mapped these
terms to LCSH and AAT vocabularies. The Categories element is populated by the Powerhouse
Object Names Thesaurus (PONT), a vocabulary of 7,595 terms. Van Hooland et. al. (2013)
chose the LCSH vocabulary because it is available in RDF format as a linked data source, it is
the largest controlled vocabulary available in English, and it is the most widely adopted subject
index used worldwide (p. 468). They chose AAT because it is "the 'most widely known specialist
thesaurus' (Broughton, 2006, p. 41), developed for the cultural heritage domain with a specific
focus on art, architecture, and material culture." (van Hooland, et. al., 2013, p. 468).
The X Museum case study has adapted the FYM method to the peculiarities of its set of
data. Instead of using data from the entire collection, this study focused on the terms in the X
Museum's local subject vocabulary. Although a portion of records in the museum’s collection
have not been assigned subject terms at all2, where they do exist, these terms are accurately
2 Large sub-sets of the X Museum collection records do not have any subject terms assigned. The proposed approach for these records is to profile other metadata elements in the records for terms that could be mapped to LCSH and Iconclass. These elements include Title, Description, and Culture; and for the Photographs collection, Inscription. The Title element would be the primary approach, because the title field is a required element, and the titles of works of art often describe their subjects (although "Untitled" is used for a large sub-set of objects with no title). The Description element, while not a required element, tends to include terms describing the object itself, iconography, and scenes depicted, all of which may give clues to the subject. The Culture element provides geographical information that may help with subject assignments. Since photographs often include inscriptions describing the scenes depicted, the inscription field may also include terms that could map to vocabularies for this sub-set. For this case study, X Museum's photographs collection records were given a preliminary examination in order to make future recommendations about the feasibility of this approach for the sub-set of records without subject terms assigned. A full analysis of these data and reconciliation of the records to LOD vocabularies is outside the scope of this case study.
MUSEUM COLLECTIONS METADATA AND LINKED DATA 12
mapped to the X Museum's local vocabulary. By mapping the local vocabulary, the hope is to
effectively create an association for all of the records that have subject terms assigned. Thus,
rather than manipulating hundreds of thousands of item records in Open Refine, the work is
limited to the 2,337 preferred terms in the X Museum's subject vocabulary.
This case study chose to reconcile the X Museum's terms with LCSH for the same
reasons it was chosen for the FYM method, as described above. LCSH also offers a means for
the X Museum to align its collection with a broadly defined vocabulary that is not domain-
specific, offering opportunity for opening the collection to non-subject experts. In addition to
LCSH, this study chose Iconclass as a second vocabulary, instead of AAT. This is because many
terms in the X Museum's local vocabulary describe iconography and compositions of the works
of art in the collection. These types of terms do not exist in the LCSH, but are heavily
represented in Iconclass for European subjects. For example, common compositions in European
art from the Christian religion (e.g. pieta, Mary as Shepherdess) and Greek Mythology (e.g.
putti, Marriage of Telemachus and Circe) are represented in Iconclass. Mapping to AAT is not
necessary in this case, as much of the X Museum's collection is already mapped to AAT
(although not in LOD format), and a separate project to add X Museum collections records to the
Getty's new CONA vocabulary is underway, and will provide LOD connections to all of the
Getty vocabularies.
Unfortunately, the Iconclass vocabulary was not successfully reconciled with the X
Museum vocabulary for this project. Iconclass does not provide a machine-readable protocol
service for its data, which is the most direct way for Open Refine to access LOD. Attempts to
upload the Iconclass data file directly into Open Refine, using the RDF/SKOS metadata format,
returned a fatal error. Although reconciliation to Iconclass was not possible, the process of
MUSEUM COLLECTIONS METADATA AND LINKED DATA 13
matching the X Museum subject terms to the LCSH vocabulary did reveal clear areas where
Iconclass could provide matches to the X Museum vocabulary that were left unmatched to
LCSH.
In summary, this case study adapts the FYM method of vocabulary reconciliation to the
specific situation of the X Museum collection data. Instead of utilizing metadata for individual
objects in the collection, it focused on mapping the X Museum local subject vocabulary to the
LCSH controlled vocabulary. This study also did not reconcile the X Museum data to a second
vocabulary, nor did it attempt to assess whether objects not assigned subject terms at all would
be served by the existing local vocabulary.
First Step—Profiling and Cleansing Metadata
The first step of the FYM method is to profile and clean the data. Very little cleaning was
necessary for the X Museum vocabulary. This is probably due to the fact that it is a controlled,
intentionally created list. There were no duplicate fields, each term field contained only one
term, and there was no need to normalize terms with different spellings and formats, because
each term only exists once (except for the terms repeated in different contexts, which shouldn't
be merged anyway), and no typos were found. This process revealed one of the advantages of
working with a controlled list of terms as opposed to terms assigned individually over time in a
large dataset.
Profiling the data involved using Open Refine to become familiar with the data.
Clustering, sorting, and faceting tools help to analyze how the is structured, understanding the
depth and breadth of the terms, and seeing where clusters of concepts appear, and understanding
the distribution of use for various types of terms. One disadvantage of working with the subject
vocabulary instead of the actual collections data is that an analysis of the distribution of terms is
MUSEUM COLLECTIONS METADATA AND LINKED DATA 14
not possible. Through the FYM method, van Hooland et. al. (2013) were able to visualize the
skewed distribution of the Powerhouse Museum terms across the entire collection. Such
information can become useful during reconciliation, for example, by informing choices about
merging concepts.
The 3,389 terms in the X Museum subject vocabulary are categorized into four types:
Term (2,204), Synonym (32), Related (101), and Alternate (1502). After verifying with the
collections manager that “Alternate” terms are simply alternate spellings of Terms, made
available for search optimization, and not used in the collections records, these were removed
from the data, resulting in 2,337 terms. Synonyms and Related terms were kept, as it is unclear
how these may differ from Terms, or how they may be assigned to the collections records.
The X Museum vocabulary is organized in a hierarchy of five levels. Faceting the terms
in Open Refine by level reveals that Level 3 contains the bulk of all terms:
Table 1 Number of terms in X Museum subject vocabulary by level in hierarchy
Level # of Terms
Level 1 7
Level 2 56
Level 3 1,463
Level 4 588
Level 5 223
Terms become more specific in each level of the hierarchy, with Level 1 being the most general,
and Level 5 the most specific. The seven Level 1 terms do not describe single concepts or
subjects, but describe general topics, such as the natural world, or buildings and architecture.
Level 4 and 5 terms on the other hand, are extremely specific. All but one term in Level 5 are
names. Names also comprise 38% of Level 4 terms, which also includes specific terms like
MUSEUM COLLECTIONS METADATA AND LINKED DATA 15
varieties of flowers, types of birds, and names of mythical beasts in Greek mythology. While
Level 1 and Level 5 were unlikely to produce matches with the LCSH vocabulary, these terms
were kept in the data set; they can easily be removed from totals at the end, and any terms that do
match could provide useful. Based on the distribution in this hierarchy, one might expect that
terms in Level 3 would produce the most matches with LCSH. However, many names also
appear in Level 3, and some general terms that may be useful for searching exist in Level 4.
It was desirable to remove the proper names from the data set because names are not
subjects—the LCSH is a subject vocabulary, not a name authority. However, because there is no
way for Open Refine to know which terms are proper names so they can be filtered out, these
were included in the reconciliation. During the reconciliation process proper names were flagged
manually so that they could be easily removed from totals. As we will see below, quite a few of
these proper nouns did have corresponding subject terms in LCSH.
Second Step—Reconciliation
While Google Refine can create automatic matches to LOD sources, in reality the
reconciliation process is comprised of several steps, most of which require manual work. After
loading LCSH data into Open Refine as a reconciliation service via its SPARQL endpoint3, the
program can be tasked to reconcile data in a single data column. The automated reconciliation
produces one of three possibilities for each term, as seen in the screen capture below, in the
"Term_LCSH" column: a match (term becomes a blue link, e.g. Walking), a list of suggested
matches (light blue links appear under the term, e.g. Love), or no match (term is in black with
option to create new topic below, e.g. Annals).
3 The Wikipedia (2013) definition is "SPARQL (a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format" (1st sentence).
MUSEUM COLLECTIONS METADATA AND LINKED DATA 16
Figure 1 Screen capture from Google Refine, showing three types of matches in the "Term_LCSH" column after automated reconciliation.
The initial, automated reconciliation matched 10% of the X Museum subject terms to
LCSH. In their initial reconciliation of the Powerhouse Museum collection with LCSH, van
Hooland et. al. (2013) matched 12.2% of single terms (p. 471). After examining the way that the
LCSH terms were assigned by Open Refine, van Hooland et. al. (2013) were able to
"preprocess" the LCSH linked data in order to greatly increase the number of matches on the
Powerhouse Museum terms to 47.2% (p. 472). On the Free Your Metadata website, the authors
provide access to their "preprocessed" LCSH data and suggest that this will also increase
matching rate for other institutions. Indeed, when this study applied the "preprocessed" LCSH
data to the X Museum vocabulary terms, the percentage of matched terms doubled to 20%.
Unfortunately, upon closer examination, it appears that many of the LCSH terms were
mismatched to the X Museum terms. A spot check revealed many terms assigned were either too
general, or on the wrong topic entirely. This issue may be explained by the fact that the
"preprocessing" of LCSH terms was designed by van Hooland et. al (2013) to address specific
MUSEUM COLLECTIONS METADATA AND LINKED DATA 17
issues in the Powerhouse Museum vocabulary, including the use of term qualifiers and complex,
multi-term concepts, as well as an inconsistent use of singular and plural terms. Since the X
Museum vocabulary is structured differently form the Powerhouse Museum vocabulary, the
preprocessing may have over-simplified the LCSH vocabulary for the current case study.
After reconciliation, the matched terms must then be checked manually for mismatches.
Open Refine's suggested matches help speed up this matching process. Many of these
suggestions simply require ticking off the box next to the correct term to add the link. But many
of these suggestions are not appropriate, and these terms, and the unmatched terms, require
manual lookup in the LCSH vocabulary to locate the appropriate term. Open Refine provides a
convenient search interface to the LCSH vocabulary (see Figure 2), but some terms simply need
to be researched in the LCSH vocabulary provided by the Library of Congress's linked data
service (http://id.loc.gov/).
Figure 2 A tool within Open Refine offers a search interface to the linked vocabulary.
MUSEUM COLLECTIONS METADATA AND LINKED DATA 18
Manual Matching
Based on the organization of the X Museum vocabulary into hierarchical levels, and the
desire to remove the personal names from the final count, the method for manual reconciliation
of the remaining terms proceeded as follows:
1. Work with one level at a time, starting with Level 5
2. Match proper names to suggested subject terms (see Appendix A for rules guiding this
process)
3. Flag proper names so they can be removed from totals
4. Manually match remaining terms quickly (using a maximum of three searches in LCSH
website) and note how much time this took
5. Take note of sub-sets of X terms that present problems matching to LCSH
6. Create totals for each level (Table 2)
Table 2 X Museum subject terms matched to LCSH by term hierarchy level, with adjustments after removing number of proper names
Level 1 Level 2 Level 3 Level 4 Level 5 Totals
# Terms in Museum vocabulary 7 56 1463 588 223 2,337
# Terms matched automatically 1 1 159 70 0 231
% Terms matched automatically 14% 2% 11% 12% 0% 10%
# Terms matched after manual reconciliation
1 47 779 237 38 1,102
% Terms matched after manual reconciliation
14% 84% 53% 40% 21% 47%
Proper Names# Proper names 0 0 557 222 222 1,001
# Proper names matched 0 0 88 15 36 139
% Terms matched after removing proper names
14% 84% 76% 61% 100% 72%
MUSEUM COLLECTIONS METADATA AND LINKED DATA 19
Analysis
As seen from the results presented in Table 2, removing the proper names form the X
Museum vocabulary significantly increases the matches to the LCSH vocabulary. And, as
suspected, Levels 1 and 5 contain terms that are not compatible with the subject headings in
LCSH. Focusing on Levels 2-4, we see that Level 3 is not necessarily the level with the highest
percentage of terms matched. Although there are very few terms in Level 2, and few of these
matched after the automatic reconciliation, manual reconciliation raised the number of matches
significantly. Comparing these numbers with the FYM method, we see that 50% of the
Powerhouse Museum terms were matched to LCSH after automatic reconciliation, using of the
full Powerhouse Museum collection data with the "preprocessed" LCSH terms. In this case
study, a combination of automatic and manual reconciliation of the X Museum vocabulary
resulted in 72% of terms matched to LCSH.
While manually matching the terms, several issues with the X Museum vocabulary
became obvious. First, not only does the X vocabulary contain many proper names, it also
includes many literary titles, place names, and mythological names. All of these terms were
matched to LCSH if a subject heading was easily located—no more than three searches were
made on the LCSH website. However, many of these names did not have a corresponding term
in LCSH. For example, the X Museum vocabulary contains the names of many churches, as well
as the names of many minor gods and creatures in Greek mythology. Mapping these names to
broader terms in the LCSH (e.g. City Churches; Rural Churches; Gods, Greek) would increase
the matched terms, and also may provide better findability for non-subject-experts. Similarly, the
MUSEUM COLLECTIONS METADATA AND LINKED DATA 20
many specific names types of flowers, insects, and birds in the X Museum vocabulary could be
mapped to broader terms in LCSH to improve findability.
Secondly, Greek names created problems for automatic matching, requiring added time
to perform manual matching. Many Greek mythology names in the X Museum vocabulary are
also used as biological names of plants and animals. This caused many mismatches with LCSH
that had to be rectified. For example, Hydra matched automatically to the LCSH term for the sea
creature. The Greek names also have many alternate spellings, requiring troubleshooting when
searching for the term in LCSH. Because this case study took no more than three attempts
matching any one term to LCSH, many more of these names may be matchable in LCSH with
more research.
Thirdly, as suspected, the X Museum vocabulary does include many terms that describe
common iconography and scenes in works of art (e.g. Adoration of the Shepherds). With the
exception of some terms describing Biblical scenes (probably because they also exist in
literature), most of these terms were not available in LCSH. This is where mapping to the
Iconclass vocabulary would likely prove very useful, and fill in many of the gaps left by LCSH.
Conclusion and Recommendations
This case study began with four key questions, which were answered as follows.
Question1: What metadata elements from the X Museum collections are best used for
the reconciliation process? Because the X Museum collection subject vocabulary is closely
controlled and mapped consistently to the collection records, the reconciliation was performed
on the Term element field within the vocabulary itself. However, because so many proper names
are included in this vocabulary, the overall matching rate is lowered. However, many of the
MUSEUM COLLECTIONS METADATA AND LINKED DATA 21
proper names in the X Museum vocabulary do match to LCSH historical subjects, and thus may
provide useful for search. The recommendation is to keep these proper names and their links to
the LCSH LOD vocabulary in the data set. With further research, the remaining unmatched
proper names could be used to identify appropriate subject terms in LCSH (e.g. Alan Pinkerton
could be mapped to the LCSH term Spies).
Question 2: Will the LCSH or Iconclass vocabulary produce better results? Or would a
combination of both vocabularies produce best results? Because the case study was not able to
achieve reconciliation with the Iconclass vocabulary, this question cannot be answered.
However, it is clear that the Iconclass vocabulary could be used to supplement the LCSH LOD
for those terms in the X Museum vocabulary that describe iconography and common
compositions in works of art. If the technical issues with the Iconclass RDF files can be resolved,
it is recommended to reconcile the X Museum vocabulary with Iconclass.
Question 3: What percentage of records can be mapped to the LOD cloud? Taking
proper names out of the mix, 72% of the X Museum collection vocabulary terms were matched
to LCSH. Determining how many collection records this mapping would affect would require
calculating how many records are assigned terms not matched to LCSH at all. The current study
does not have access to that information. Since the unmatched terms tend to be highly specific, it
is possible that those terms are not the only term assigned to any single object record. Therefore,
it's likely that the 74% of terms that matched to LCSH are used to describe a majority of objects
in the X Museum collection (not including those records with no subject terms assigned)
Quetsion 4: How much data manipulation and manual labor is required for the
reconciliation process? Most of the reconciliation of the X Museum collection subject
vocabulary to LCSH was completed manually. Only 10% of the terms were matched after the
MUSEUM COLLECTIONS METADATA AND LINKED DATA 22
automatic reconciliation process. The automatic process was very quick—after loading the
vocabulary into Open Refine, the reconciliation took about 30 minutes to complete on a standard
residential cable Internet connection. If the proper names are removed from the data set, that
number increases to 17% of terms matched. This is consistent with the findings of van Hooland
et. al (2013), who matched 12% of the Powerhouse Museum collection records with automatic
reconciliation using the LCSH LOD as it is provided by the Library of Congress (p. 471). The
"preprocessing" of the LCSH vocabulary data that these authors performed in order to increase
automatic matching seems to have been tailored specifically enough to the Powerhouse Museum
collection as to not be useful in this case study. The authors do not indicate how much time the
"preprocessing" took, but this can be seen essentially as a form of manual approach to the
reconciliation.
After profiling the data and performing the automatic reconciliation, the manual process
in this case study took about 10 hours for one person to complete. The Open Refine tool certainly
aided the manual reconciliation process by providing a search tool within its interface; creating
automatic links between the data in the system and the LCSH LOD; and providing tools for
sorting, creating facets, flagging, and editing records in bulk. Manual reconciliation of the X
Museum subject vocabulary involved comparing suggested terms to the X Museum collection
context and making decisions about whether a match was topical, and looking up terms in the
LCSH to find possible matches. This case study limited such searches to no more than three;
further research could suggest additional matches that would increase the percentage of
vocabulary terms matched to the LCSH vocabulary.
MUSEUM COLLECTIONS METADATA AND LINKED DATA 23
Appendix A--Rules for matching proper names in the X Museum vocabulary to LCSH subject headings
1. MATCHED if the name exists in LCSH as a subject headinga) IF the name is used as general marker for a historical period andb) IF there is 100% certainty about the intended person matching the LCSH term
e.g. Louis XIV
LCSH options: Paris (France)--History--Louis XIV, 1643-1715 Louis XIV, King of France, 1638-1715--Portraits Decoration and ornament--Louis XIV style France--History--Louis XIV, 1643-1715
Choose France--History--Louis XIV, 1643-1715
2. NOT matched if < 100% certainty about the person matching the term, or disambiguation among several options requires research.
e.g. Frederick I
LCSH options: Norway--History--Frederick I, 1523-1533 Denmark--History--Frederick I, 1523-1533 Germany--History--Frederick I, 1152-1190
3. NOT matched if the name is used in LCSH only as qualifier for a more specific concept
e.g. King David
LCSH options (David, Kind of Israel not available): David, King of Israel -- in art David, King of Israel -- in literature David, King of Israel -- in rabbinical literature
4. MATCHED names to a broader term if it is narrower than the next level up in the museum’s vocabulary hierarchy
e.g. All saints' names matched to Christian SaintsSaint Agnes matched to Christian SaintsSaint Andrew matched to Christian Saints
5. MATCHED mythological names, literary characters, legendary characters if available in LCSH.
MUSEUM COLLECTIONS METADATA AND LINKED DATA 24
References
Allinson, J. (2012). OpenART: Open Metadata for Art Research at the Tate. Bulletin of the
American Society for Information Science and Technology, 38(3).
Baca, M. (2007). CCO and CDWA Lite: Complementary Data Content and Data Format
Standards for Art and Material Culture Information. Visual Resources Association
Bulletin, 34(1), 69-75.
Barker, G. (2012, September 1). More on museum datasets, un-comprehensive-ness, data
mining. (Web log comment). Retrieved from
http://www.freshandnew.org/2012/08/museum-datasets-un-comprehensive-ness-data-
mining/
Berners-Lee, T. (2006, July 27). Linked Data - Design Issues. Retrieved from
http://www.w3.org/DesignIssues/LinkedData
Berners-Lee, T. (2009). Tim Berners-lee on the Next Web. (Video file). Retrieved from
http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
Chan, S. (2012, August 23). More on museum datasets, un-comprehensive-ness, data mining.
(Web log). Retrieved from http://www.freshandnew.org/2012/08/museum-datasets-un-
comprehensive-ness-data-mining/
Coburn, E., Lanzi, E., O'Keefe, E., Stein, R., Whiteside, A. (2010). The Cataloging Cultural
Objects experience: Codifying practice for the cultural heritage community. IFLA
Journal (March 2010) (36)1, 16-29. doi:10.1177/0340035209359561
Elings, M. W., & Waibel, G. (2007). Metadata for All: Descriptive Standards and Metadata
Sharing across Cultural Heritage Communities. Visual Resources Association Bulletin,
34(1), 7-14.
MUSEUM COLLECTIONS METADATA AND LINKED DATA 25
Free Your Metadata. (2013). freeyourmetadata.org
Goodlander, G. (2013, March 21). OpenGLAM: LOD and American Art. (Slide presentation).
http://www.slideshare.net/georginab/openglam-lod-and-american-art
Henry, D. & Brown, E. (2012). Using an RDF Data Pipeline to Implement Cross-Collection
Search. In D. Bearman & J. Trant (eds.) Museums and the Web 2012: Proceedings. San
Diego: Archives & Museum Informatics, 2012. Retrieved from
http://www.museumsandtheweb.com/mw2012/papers/using_an_rdf_data_pipeline_to_im
plement_cross_
Isaac, A. (2012, March 26). Europeana and Linked Open Data. (Web log). Retrieved from
http://blog.okfn.org/2012/03/26/europeana-and-linked-open-data/
Isaac, A., Clayphan, R., & Haslhofer, B. (2012). EUROPEANA: Moving to Linked Open Data.
Information Standards Quarterly, 24(2/3), 34-40. doi:10.3789/isqv24n2-3.2012.06
Improving Digital Record Annotation Capabilities with Open-sourced Ontologies and Crowd-
sourced Workers. (2012, June 5). (Press release). Retrieved from LexisNexis.
Kessler, B. (2007). Encoding Works and Images: The Story Behind VRA Core 4.0. Visual
Resources Association Bulletin, 34(1), 20-33.
Koolen, M., Kamps, J., & De Keijzer, V. (2009). Information Retrieval in Cultural Heritage.
Interdisciplinary Science Reviews, 34(2/3), 268-284. doi:10.1179/174327909X441153.
Library of Congress. (n.d.). LC Linked Data Service: Authorities and Vocabularies.
http://id.loc.gov/
Linked Open Data in Libraries, Archives and Museums. (2013). lod-lam.net
Museums and the Machine-Processable Web wiki. (2013). http://museum-api.pbworks.com/
MUSEUM COLLECTIONS METADATA AND LINKED DATA 26
Museum APIs (2013). Retrieved from
http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIs
OpenGLAM. (2013). http://openglam.org/
OpenRefine (n.d.). http://openrefine.org/
Purday, J. (2009). Think culture: Europeana.eu from concept to construction. The Electronic
Library, 27(6), 919 – 937.
RDF Refine. (n.d.). – A Google Refine extension for exporting RDF. http://refine.deri.ie/
STITCH. Semantic Interoperability To access Cultural Heritage. http://www.cs.vu.nl/STITCH/
USC tech experts to guide Smithsonian museum to next generation of the Internet (2012,
December 8). (Press release). Retrieved from LexisNexis.
van Hooland, S., De Wilde, M., Verborgh, R., Mannens, E., Van de Walle, R., & Hercher, J.
(2013). Evaluating the success of vocabulary reconciliation for cultural heritage
collections. Journal of the American Society of Information Science, 64(3): 464–479.
doi:10.1002/asi.22763
Waibel, Gunther. (2010). Museum Data Exchange: Learning How to Share. D-Lib Magazine,
16(3/4). doi:10.1045/march2010-waibel
Wikipedia. (2013). SPARQL. http://en.wikipedia.org/wiki/SPARQL