a method for aligning museum collections metadata · web viewalthough metadata standards for...

Running Head: MUSEUM COLLECTIONS METADATA AND LINKED DATA

A Method for Aligning Museum Collections Metadata with Standards, Using Linked Data

Case Study: The [name withheld] Museum Collection Subject Vocabulary

Susan EdwardsMay 13, 2013

SJSU – School of Library and Information ScienceLIBR 281 - Metadata

MUSEUM COLLECTIONS METADATA AND LINKED DATA 2

Museums have been working together for several decades to find a way to share and

aggregate their collections data (Waibel, 2010). A key barrier to success has been the unique,

idiosyncratic nature of most museum collections metadata schemata. Although metadata

standards for describing cultural heritage objects and works of art do exist, museums have not

accepted a single standard that would facilitate universal interoperability (Coburn et. al., 2010;

Koolen, Kamps & De Keijzer, 2009; Waibel, 2010). In addition, museums tend to develop local

vocabularies tailored to the unique nature of a specific collection. Thus, sharing data and

searching across collections is a challenge not only because the structure of the data differs from

institution to institution, but because the very language used to describe their collections also

differs. Linked open data (LOD) may offer a solution to the problem of museum metadata

aggregation and sharing, precisely because LOD can facilitate interoperability without requiring

standard schemata and vocabularies.

The case study described in this paper applies and evaluates one technique for using LOD

to increase the interoperability of museum collections data. Developed by the Free Your

Metadata (FYM) project (http://freeyourmetadata.org/about/) (van Hooland, et. al., 2013), the

method involves mapping a local vocabulary to standard controlled vocabularies using linked

data. The authors aimed to demonstrate how reconciling local vocabularies to the linked data

cloud can be used to "derive additional value" from existing metadata (van Hooland, et. al.,

2013, p. 464). This case study applies the FYM method to the X Museum collection (name

withheld; X is a pseudonym). The museum is in the process of updating the presentation of their

collections online and has found that the local vocabulary needs to be updated. The museum is

interested in mapping the local vocabulary to a standard, which would provide more flexibility

for future cataloging and data interoperability, and also look towards a future when the data

http://freeyourmetadata.org/about/


available as LOD, or integrated as part of a cross-museum-collections search. By mapping to

standards that are available via LOD, and by adding the relevant LOD metadata to their own

metadata in the process, the museum could take one giant leap towards future interoperability

with other museum collections.

This paper will provide a review of the literature about issues with museum metadata

interoperability, and about the emerging practice of museums using LOD. It will then provide a

report on the application of the FYM methodology to the X Museum collection data, with an

analysis of the results, and recommendations for future work.

Literature Review

The problem of heterogeneous metadata. The sharing of collections data is not new to

museums. Waibel (2010) points out that the history of museum metadata interoperability desires

dates back to 1969 and the establishment of the Museum Computer Network (MCN). A wiki

created and maintained by museum technology professionals and dedicated to discussion and

information sharing about museum collection data, "Museums and the machine-processable

web" (Museum API Wiki) http://museum-api.pbworks.com/, includes a long list of institutions,

including many museums, that make collections data available upon request (Museum APIs,

2013). But the form in which the data is provided, the methods of delivery, and the schemas

utilized, vary widely. Most museums have created their own unique APIs to provide access, or

use the Open Archives Initiative Protocol for Machine Harvesting (OAI-PMH) to provide raw

data. The plethora of proprietary APIs for accessing museum collections data point to a

longstanding issue in this sector: museums simply don't have standard, shared metadata schemas,

vocabularies, or description frameworks, for their collections.

http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIs


Chan (2012) and Elings and Waibel (2007) contrast the nature of museum collections

with those of archives and libraries in order to explain this situation. Libraries and archives,

which have been at the forefront of metadata standards development, have standard methods of

describing their collections via schemas, cataloging rules, and vocabularies, such as AACR,

LCSH, MARC, Dublin Core, and DACS, which every institution uses. But because museums

have very diverse collections of unique objects, they have not been set up in the same way as

libraries, and have often described their collections in an idiosyncratic, inconsistent manner.

Many have developed their own internal description standards and vocabularies, and have even

applied those standards inconsistently over time (Waibel, 2010). Barker (2012) argues that the

fundamental philosophy of museums is different from libraries: libraries have focused on access

to resources and do little interpretation, while museums tend to focus on collecting and providing

interpretation about their objects. White (2012) echoes this sentiment, pointing out that the

varied nature of museum data is a manifestation of every museum's aspiration to be seen as

distinct through its own collection. Thus, museum metadata is so heterogeneous and

idiosyncratic from institution to institution that a major interoperability problem is presented

when institutions begin talking about sharing data (Chan, 2012; Henry and Brown, 2012; Isaac,

Clayphan, & Haslhofer, 2012; Koolen, Kamps, & De Keijzer, 2009; Waibel, 2010).

In the past eight years, much effort has been devoted to the various metadata schemata

for cultural heritage and visual culture objects, with the aim of providing a single standard

vehicle for data aggregation and sharing. In 2005, Scruton pointed to the findings of the FAIR

(Focus on Access to Institutional Resources) program in the UK, a project to experiment with the

viability of metadata harvesting, which found that Dublin Core was an inadequate metadata

schema for museum collections. In 2007 an issue of the Visual Resources Association Bulletin


was dedicated to exploring the sticky issues around metadata schemas for visual culture,

rehashing the problem of heterogeneous data, analyzing the appropriateness of the various

schemata for data exchange, and arguing the benefits of standards for interoperability. Several

authors analyzed the applicability of various metadata schemata for museums—from Categories

for the Description of Works of Art (CDWA), to Cataloguing Cultural Objects (CCO), to the

Visual Resources Association Core metadata (VRA Core) (Baca, 2007; Elings and Waibel, 2007;

Kessler, 2007). Kessler (2007) acknowledged that, "a truly viable mechanism for shared

cataloging continues to elude the visual resources community" (p. 20).

Deciding on a standard metadata schema, difficult as that is, does not seem to be the only

issue. The real problem is the data itself, as Chan (2012) states bluntly, "… the truth is that our

[the museum sector's] data sucks." For example, Ridge (2012), in her examination of the data

from a single collection, discovered that the application of inconsistent metadata standards over

time was the main barrier to working with the data. Waibel (2010) discusses the Museum Data

Exchange project, an OCLC funded project started in 2007 to lower the barrier to metadata

exchange in museums. The project quickly ran up against issues with the quality of the data itself

—inconsistently applied terms, a lack of standard vocabularies, and missing information. Isaac,

Clayphan, and Haslhofer (2012) also describe similar issues found with museum data through the

European Union's Europeana project (http://www.europeana.eu/), a digital library of European

cultural heritage that makes digitized collections from more than 2,000 institutions across Europe

—primarily libraries and archives—available through one portal (Purday, 2009). With such

inconsistency in the underlying data, it would seem that even if a set of standard metadata

schemata could be agreed to in museums, it would take a large effort for these institutions to

reconcile their data sets to that standard.

http://www.europeana.eu/


Linked data to the rescue. Van Hooland et. al (2013), authors of the FYM method,

argue that LOD will solve the problems of heterogeneous metadata in the museum sector. Rather

than focusing on standardizing metadata schemata, they argue that the semantic structure of the

Resource Description Framework (RDF), a metadata format for linked data, is particularly suited

to reconciling local vocabularies with standard vocabularies, which are available as linked data:

The term 'Linked Data' is referenced as a set of best practices to publish and

connect entities (rather than only documents). The ambition is to create a global

data space of networked resources that can be queried with generic tools, rather

than putting publishers and agents in charge of understanding custom APIs of so-

called data-silos. (p. 465)

These authors point out that the Library of Congress Subject Headings (LCSH) is available as

LOD, and the Getty's Art and Architecture (AAT) vocabulary will also soon be available as

LOD. They used data from the Powerhouse Museum's collection to demonstrate that these two

vocabulary standards can be reconciled automatically, using the existing Powerhouse Museum

metadata. This approach, they ague, can provide clear entry points for sharing the museum's data

set in the universe of linked data, and therefore increasing the accessibility via search.

Angjeli et. al. (2009) performed an experiment similar to the FYM project to align the

vocabularies used in two archival collections of illuminated manuscripts using RDF in order to

provide cross-collection search capability. At the time, that experiment required first translating

the two vocabularies into RDF. Today, four years later, many controlled vocabularies are

available in RDF format as linked data. Angjeli et. al. (2009) clearly foresaw the LOD potential

of their experiment: "Another, more recent direction [to solve the problem of heterogeneous

cultural heritage data] is to use the techniques of the semantic web, based on the fact that


controlled vocabularies, as true [Knowledge Organization System] KOSs, can be likened to

ontologies, which stand at the core of the semantic web vision. (p. 26)"

There is currently much activity in the area of RDF-formatted LOD for museum

collections. Searching in Google for "RDF and museums" retrieved several white papers, reports

and articles in open-source publications; as well as blog posts and wikis updated with reports in

the past two to three years. In addition to the Museum API Wiki, several consortiums are

dedicated to research and discussion about LOD in the cultural heritage sector, and specifically

include museums in their purview. They host websites dedicated to sharing information and

collaborating across the sector. These include:

LOD-LAM community (Linked Open Data in Libraries Archives and Museums,

lod-lam.net)

OpenGLAM (Galleries, Libraries, Archives, and Museums), an initiative of the

UK's Open Knowledge Foundation (http://openglam.org/)

STITCH (Semantic Interoperability To access Cultural Heritage), a project from the

CATCH (Continuous Access to Cultural Heritage) program of the Netherlands Organization

for Scientific Research (http://www.cs.vu.nl/STITCH/)

This fairly new approach of the FYM method may provide an entry point for many

museums to step into linked data. The Getty Research Institute has reported that its

vocabularies, which include the Art and Architecture Thesaurus (AAT), the Union List of Artist

Names (ULAN), the Thesaurus of Geographic Names (TGN), and the Cultural Objects Name

Authority (CONA) will be made available as linked data using RDF very soon (Harpring, 2012).

In the world of LOD, perhaps the lack of a standard metadata schema for museums is a non-

issue. The Getty vocabularies are an agreed-upon controlled standard in this sector. These, with

http://www.cs.vu.nl/STITCH/

http://openglam.org/


the help of LOD, may effectively become the closest thing to an interoperability tool than this

sector has ever had.

Linked data experiments in museums. Only three museums currently provide their

entire collections data as LOD in RDF format. All three are part of the Europeana project: the

National Gallery, London (Museum API Wiki, 2013); the British Museum, London (Museum

API Wiki, 2013); and the Rijksmuseum in the Netherlands (STITCH, n.d.). Europeana does not

currently use LOD, but began experimenting with LOD in 2011 (Isaac, 2012)1 and developed its

own data model based on RDF for this purpose (Isaac, Clayphan, & Haslhofer, 2012). In

addition to the Europeana project's experiments with RDF, Allinson (2012) discusses an

experiment by the Tate Museum, London as part of the OpenART project in 2011. The museum

provided data about a sub-set of its collection related to the history of the London art world

between 1660 and1735 with the aim of enhancing a collaborative website about the London art

world with semantic data (http://artworld.york.ac.uk/). Henry and Brown (2012) also discuss

experiments at the Missouri History Museum to use RDF to create cross-collections search

within their institution.

The British Museum is a success story on this front. In a blog post summarizing the 2011

JISC conference, Stevenson (2011) reports that the British Museum had a huge problem with

heterogeneous data—seven different datasets developed in different parts of the institution.

These data are now available by LOD, which means that those data are now all interlinked. The

British Museum pointed out that linked data does not solve the problem of inconsistent or

missing data, however, it can help to expose these issues and thus lead to solutions (Stevenson,

2011).

1 The Europeana digital library indicates that it includes data from the Musée du Louvre in Paris, but this author was not able to verify that the Louvre is providing its data as LOD or in RDF format.

http://artworld.york.ac.uk/


In addition to the efforts outlined above, a few museums have also declared their

intention to publish collections as LOD soon. The Museum of the City of New York released a

press release in June of 2012 announcing the intention to release collections data using LOD (but

not specifying whether RDF would be utilized) (Improving…, 2012). In December 2012, the

University of Southern California announced a partnership with the Smithsonian's American Art

Museum to provide LOD for its collections (USC tech experts to guide Smithsonian…, 2012).,

which will use RDF (Goodlander, G., 2013).

Research Methodology

In their peer-reviewed journal article, Van Hooland et. al. (2013) documented the process

used to reconcile collections data from the Powerhouse Museum in Sydney, Australia with

controlled subject vocabularies available as LOD—what this study calls the FYM method. In

simple terms, the FYM method involves uploading the collections data into Open Refine

(http://openrefine.org/, a free tool for data transformation, which was created as an open source

project by Google), profiling and cleaning the data, and then using Open Refine's reconciliation

function to map terms in the data to a linked data source. Once reconciliation is complete, Open

Refine can also be used to create a new field in the metadata to hold the links (URIs) to the LOD,

and the data can be exported in multiple formats, including RDF/XML, RDF as Turtle, HTML,

and Microsoft Excel. By using a free tool, publishing the steps of their process, and also making

these directions available online on the Free Your Metadata website

(http://freeyourmetadata.org), the authors aimed to provide the cultural heritage community with

the tools to link their own data to these controlled standards. The hope is that these controlled

http://freeyourmetadata.org/

http://openrefine.org/


vocabularies could become a hub for cross-linking between collections (Free Your Metadata,

2013).

Key Questions. The current case study aimed to apply the FYM method to map the X

Museum's collections data to two vocabularies: the Library of Congress Subject Headings

(LCSH), and Iconclass (www.iconclass.org). While van Hooland et. al. (2013) provide

quantitative analysis of their results with the Powerhouse Museum collection, we cannot

necessarily extrapolate these results to other museum collections, precisely because every

museum collection and its data are unique. Indeed, the X Museum and Powerhouse collections

are very different—the X Museum has fine art objects from Europe and America, while the

Powerhouse Museum has a diverse historical collection of everyday objects. So, while van

Hooland et. al. (2013) mapped almost 90% of the Powerhouse Museum records to LOD using a

combination of the LCSH and Art and Architecture Thesaurus (AAT) vocabularies, and also

concluded that combining the two vocabularies produced best results, it is not clear that the FYM

method would produce the same results in every situation.

Thus, this case study aims to test the results of van Hooland et. al. with a very different

set of data. Additionally, the X Museum is interested in the results of this case study, as the

information will aid in decisions about whether to proceed with LOD reconciliation as a

approach for adjusting their existing local vocabulary.

The case study began with the following key questions:

Question1: What metadata elements from the X Museum collections are

best used for the reconciliation process?

Question 2: Will the LCSH or Iconclass vocabulary produce better results?

Or would a combination of both vocabularies produce best results?


Question 3: What percentage of records can be mapped to the LOD cloud?

Quetsion 4: How much data manipulation and manual labor is required for

the reconciliation process?

Data Sets and Vocabulary Chosen. In their FYM process, van Hooland et. al. (2013)

used the collection dataset of 75,823 objects made available online by the Powerhouse

Museum, and analyzed terms populated in the 'Categories' metadata element. They mapped these

terms to LCSH and AAT vocabularies. The Categories element is populated by the Powerhouse

Object Names Thesaurus (PONT), a vocabulary of 7,595 terms. Van Hooland et. al. (2013)

chose the LCSH vocabulary because it is available in RDF format as a linked data source, it is

the largest controlled vocabulary available in English, and it is the most widely adopted subject

index used worldwide (p. 468). They chose AAT because it is "the 'most widely known specialist

thesaurus' (Broughton, 2006, p. 41), developed for the cultural heritage domain with a specific

focus on art, architecture, and material culture." (van Hooland, et. al., 2013, p. 468).

The X Museum case study has adapted the FYM method to the peculiarities of its set of

data. Instead of using data from the entire collection, this study focused on the terms in the X

Museum's local subject vocabulary. Although a portion of records in the museum’s collection

have not been assigned subject terms at all2, where they do exist, these terms are accurately

2 Large sub-sets of the X Museum collection records do not have any subject terms assigned. The proposed approach for these records is to profile other metadata elements in the records for terms that could be mapped to LCSH and Iconclass. These elements include Title, Description, and Culture; and for the Photographs collection, Inscription. The Title element would be the primary approach, because the title field is a required element, and the titles of works of art often describe their subjects (although "Untitled" is used for a large sub-set of objects with no title). The Description element, while not a required element, tends to include terms describing the object itself, iconography, and scenes depicted, all of which may give clues to the subject. The Culture element provides geographical information that may help with subject assignments. Since photographs often include inscriptions describing the scenes depicted, the inscription field may also include terms that could map to vocabularies for this sub-set. For this case study, X Museum's photographs collection records were given a preliminary examination in order to make future recommendations about the feasibility of this approach for the sub-set of records without subject terms assigned. A full analysis of these data and reconciliation of the records to LOD vocabularies is outside the scope of this case study.


mapped to the X Museum's local vocabulary. By mapping the local vocabulary, the hope is to

effectively create an association for all of the records that have subject terms assigned. Thus,

rather than manipulating hundreds of thousands of item records in Open Refine, the work is

limited to the 2,337 preferred terms in the X Museum's subject vocabulary.

This case study chose to reconcile the X Museum's terms with LCSH for the same

reasons it was chosen for the FYM method, as described above. LCSH also offers a means for

the X Museum to align its collection with a broadly defined vocabulary that is not domain-

specific, offering opportunity for opening the collection to non-subject experts. In addition to

LCSH, this study chose Iconclass as a second vocabulary, instead of AAT. This is because many

terms in the X Museum's local vocabulary describe iconography and compositions of the works

of art in the collection. These types of terms do not exist in the LCSH, but are heavily

represented in Iconclass for European subjects. For example, common compositions in European

art from the Christian religion (e.g. pieta, Mary as Shepherdess) and Greek Mythology (e.g.

putti, Marriage of Telemachus and Circe) are represented in Iconclass. Mapping to AAT is not

necessary in this case, as much of the X Museum's collection is already mapped to AAT

(although not in LOD format), and a separate project to add X Museum collections records to the

Getty's new CONA vocabulary is underway, and will provide LOD connections to all of the

Getty vocabularies.

Unfortunately, the Iconclass vocabulary was not successfully reconciled with the X

Museum vocabulary for this project. Iconclass does not provide a machine-readable protocol

service for its data, which is the most direct way for Open Refine to access LOD. Attempts to

upload the Iconclass data file directly into Open Refine, using the RDF/SKOS metadata format,

returned a fatal error. Although reconciliation to Iconclass was not possible, the process of


matching the X Museum subject terms to the LCSH vocabulary did reveal clear areas where

Iconclass could provide matches to the X Museum vocabulary that were left unmatched to

LCSH.

In summary, this case study adapts the FYM method of vocabulary reconciliation to the

specific situation of the X Museum collection data. Instead of utilizing metadata for individual

objects in the collection, it focused on mapping the X Museum local subject vocabulary to the

LCSH controlled vocabulary. This study also did not reconcile the X Museum data to a second

vocabulary, nor did it attempt to assess whether objects not assigned subject terms at all would

be served by the existing local vocabulary.

First Step—Profiling and Cleansing Metadata

The first step of the FYM method is to profile and clean the data. Very little cleaning was

necessary for the X Museum vocabulary. This is probably due to the fact that it is a controlled,

intentionally created list. There were no duplicate fields, each term field contained only one

term, and there was no need to normalize terms with different spellings and formats, because

each term only exists once (except for the terms repeated in different contexts, which shouldn't

be merged anyway), and no typos were found. This process revealed one of the advantages of

working with a controlled list of terms as opposed to terms assigned individually over time in a

large dataset.

Profiling the data involved using Open Refine to become familiar with the data.

Clustering, sorting, and faceting tools help to analyze how the is structured, understanding the

depth and breadth of the terms, and seeing where clusters of concepts appear, and understanding

the distribution of use for various types of terms. One disadvantage of working with the subject

vocabulary instead of the actual collections data is that an analysis of the distribution of terms is


not possible. Through the FYM method, van Hooland et. al. (2013) were able to visualize the

skewed distribution of the Powerhouse Museum terms across the entire collection. Such

information can become useful during reconciliation, for example, by informing choices about

merging concepts.

The 3,389 terms in the X Museum subject vocabulary are categorized into four types:

Term (2,204), Synonym (32), Related (101), and Alternate (1502). After verifying with the

collections manager that “Alternate” terms are simply alternate spellings of Terms, made

available for search optimization, and not used in the collections records, these were removed

from the data, resulting in 2,337 terms. Synonyms and Related terms were kept, as it is unclear

how these may differ from Terms, or how they may be assigned to the collections records.

The X Museum vocabulary is organized in a hierarchy of five levels. Faceting the terms

in Open Refine by level reveals that Level 3 contains the bulk of all terms:

Table 1 Number of terms in X Museum subject vocabulary by level in hierarchy

Level # of Terms

Level 1 7

Level 2 56

Level 3 1,463

Level 4 588

Level 5 223

Terms become more specific in each level of the hierarchy, with Level 1 being the most general,

and Level 5 the most specific. The seven Level 1 terms do not describe single concepts or

subjects, but describe general topics, such as the natural world, or buildings and architecture.

Level 4 and 5 terms on the other hand, are extremely specific. All but one term in Level 5 are

names. Names also comprise 38% of Level 4 terms, which also includes specific terms like


varieties of flowers, types of birds, and names of mythical beasts in Greek mythology. While

Level 1 and Level 5 were unlikely to produce matches with the LCSH vocabulary, these terms

were kept in the data set; they can easily be removed from totals at the end, and any terms that do

match could provide useful. Based on the distribution in this hierarchy, one might expect that

terms in Level 3 would produce the most matches with LCSH. However, many names also

appear in Level 3, and some general terms that may be useful for searching exist in Level 4.

It was desirable to remove the proper names from the data set because names are not

subjects—the LCSH is a subject vocabulary, not a name authority. However, because there is no

way for Open Refine to know which terms are proper names so they can be filtered out, these

were included in the reconciliation. During the reconciliation process proper names were flagged

manually so that they could be easily removed from totals. As we will see below, quite a few of

these proper nouns did have corresponding subject terms in LCSH.

Second Step—Reconciliation

While Google Refine can create automatic matches to LOD sources, in reality the

reconciliation process is comprised of several steps, most of which require manual work. After

loading LCSH data into Open Refine as a reconciliation service via its SPARQL endpoint3, the

program can be tasked to reconcile data in a single data column. The automated reconciliation

produces one of three possibilities for each term, as seen in the screen capture below, in the

"Term_LCSH" column: a match (term becomes a blue link, e.g. Walking), a list of suggested

matches (light blue links appear under the term, e.g. Love), or no match (term is in black with

option to create new topic below, e.g. Annals).

3 The Wikipedia (2013) definition is "SPARQL (a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format" (1st sentence).


Figure 1 Screen capture from Google Refine, showing three types of matches in the "Term_LCSH" column after automated reconciliation.

The initial, automated reconciliation matched 10% of the X Museum subject terms to

LCSH. In their initial reconciliation of the Powerhouse Museum collection with LCSH, van

Hooland et. al. (2013) matched 12.2% of single terms (p. 471). After examining the way that the

LCSH terms were assigned by Open Refine, van Hooland et. al. (2013) were able to

"preprocess" the LCSH linked data in order to greatly increase the number of matches on the

Powerhouse Museum terms to 47.2% (p. 472). On the Free Your Metadata website, the authors

provide access to their "preprocessed" LCSH data and suggest that this will also increase

matching rate for other institutions. Indeed, when this study applied the "preprocessed" LCSH

data to the X Museum vocabulary terms, the percentage of matched terms doubled to 20%.

Unfortunately, upon closer examination, it appears that many of the LCSH terms were

mismatched to the X Museum terms. A spot check revealed many terms assigned were either too

general, or on the wrong topic entirely. This issue may be explained by the fact that the

"preprocessing" of LCSH terms was designed by van Hooland et. al (2013) to address specific


issues in the Powerhouse Museum vocabulary, including the use of term qualifiers and complex,

multi-term concepts, as well as an inconsistent use of singular and plural terms. Since the X

Museum vocabulary is structured differently form the Powerhouse Museum vocabulary, the

preprocessing may have over-simplified the LCSH vocabulary for the current case study.

After reconciliation, the matched terms must then be checked manually for mismatches.

Open Refine's suggested matches help speed up this matching process. Many of these

suggestions simply require ticking off the box next to the correct term to add the link. But many

of these suggestions are not appropriate, and these terms, and the unmatched terms, require

manual lookup in the LCSH vocabulary to locate the appropriate term. Open Refine provides a

convenient search interface to the LCSH vocabulary (see Figure 2), but some terms simply need

to be researched in the LCSH vocabulary provided by the Library of Congress's linked data

service (http://id.loc.gov/).

Figure 2 A tool within Open Refine offers a search interface to the linked vocabulary.


Manual Matching

Based on the organization of the X Museum vocabulary into hierarchical levels, and the

desire to remove the personal names from the final count, the method for manual reconciliation

of the remaining terms proceeded as follows:

1. Work with one level at a time, starting with Level 5

2. Match proper names to suggested subject terms (see Appendix A for rules guiding this

process)

3. Flag proper names so they can be removed from totals

4. Manually match remaining terms quickly (using a maximum of three searches in LCSH

website) and note how much time this took

5. Take note of sub-sets of X terms that present problems matching to LCSH

6. Create totals for each level (Table 2)

Table 2 X Museum subject terms matched to LCSH by term hierarchy level, with adjustments after removing number of proper names

Level 1 Level 2 Level 3 Level 4 Level 5 Totals

# Terms in Museum vocabulary 7 56 1463 588 223 2,337

# Terms matched automatically 1 1 159 70 0 231

% Terms matched automatically 14% 2% 11% 12% 0% 10%

# Terms matched after manual reconciliation

1 47 779 237 38 1,102

% Terms matched after manual reconciliation

14% 84% 53% 40% 21% 47%

Proper Names# Proper names 0 0 557 222 222 1,001

# Proper names matched 0 0 88 15 36 139

% Terms matched after removing proper names

14% 84% 76% 61% 100% 72%


Analysis

As seen from the results presented in Table 2, removing the proper names form the X

Museum vocabulary significantly increases the matches to the LCSH vocabulary. And, as

suspected, Levels 1 and 5 contain terms that are not compatible with the subject headings in

LCSH. Focusing on Levels 2-4, we see that Level 3 is not necessarily the level with the highest

percentage of terms matched. Although there are very few terms in Level 2, and few of these

matched after the automatic reconciliation, manual reconciliation raised the number of matches

significantly. Comparing these numbers with the FYM method, we see that 50% of the

Powerhouse Museum terms were matched to LCSH after automatic reconciliation, using of the

full Powerhouse Museum collection data with the "preprocessed" LCSH terms. In this case

study, a combination of automatic and manual reconciliation of the X Museum vocabulary

resulted in 72% of terms matched to LCSH.

While manually matching the terms, several issues with the X Museum vocabulary

became obvious. First, not only does the X vocabulary contain many proper names, it also

includes many literary titles, place names, and mythological names. All of these terms were

matched to LCSH if a subject heading was easily located—no more than three searches were

made on the LCSH website. However, many of these names did not have a corresponding term

in LCSH. For example, the X Museum vocabulary contains the names of many churches, as well

as the names of many minor gods and creatures in Greek mythology. Mapping these names to

broader terms in the LCSH (e.g. City Churches; Rural Churches; Gods, Greek) would increase

the matched terms, and also may provide better findability for non-subject-experts. Similarly, the


many specific names types of flowers, insects, and birds in the X Museum vocabulary could be

mapped to broader terms in LCSH to improve findability.

Secondly, Greek names created problems for automatic matching, requiring added time

to perform manual matching. Many Greek mythology names in the X Museum vocabulary are

also used as biological names of plants and animals. This caused many mismatches with LCSH

that had to be rectified. For example, Hydra matched automatically to the LCSH term for the sea

creature. The Greek names also have many alternate spellings, requiring troubleshooting when

searching for the term in LCSH. Because this case study took no more than three attempts

matching any one term to LCSH, many more of these names may be matchable in LCSH with

more research.

Thirdly, as suspected, the X Museum vocabulary does include many terms that describe

common iconography and scenes in works of art (e.g. Adoration of the Shepherds). With the

exception of some terms describing Biblical scenes (probably because they also exist in

literature), most of these terms were not available in LCSH. This is where mapping to the

Iconclass vocabulary would likely prove very useful, and fill in many of the gaps left by LCSH.

Conclusion and Recommendations

This case study began with four key questions, which were answered as follows.

Question1: What metadata elements from the X Museum collections are best used for

the reconciliation process? Because the X Museum collection subject vocabulary is closely

controlled and mapped consistently to the collection records, the reconciliation was performed

on the Term element field within the vocabulary itself. However, because so many proper names

are included in this vocabulary, the overall matching rate is lowered. However, many of the


proper names in the X Museum vocabulary do match to LCSH historical subjects, and thus may

provide useful for search. The recommendation is to keep these proper names and their links to

the LCSH LOD vocabulary in the data set. With further research, the remaining unmatched

proper names could be used to identify appropriate subject terms in LCSH (e.g. Alan Pinkerton

could be mapped to the LCSH term Spies).

Question 2: Will the LCSH or Iconclass vocabulary produce better results? Or would a

combination of both vocabularies produce best results? Because the case study was not able to

achieve reconciliation with the Iconclass vocabulary, this question cannot be answered.

However, it is clear that the Iconclass vocabulary could be used to supplement the LCSH LOD

for those terms in the X Museum vocabulary that describe iconography and common

compositions in works of art. If the technical issues with the Iconclass RDF files can be resolved,

it is recommended to reconcile the X Museum vocabulary with Iconclass.

Question 3: What percentage of records can be mapped to the LOD cloud? Taking

proper names out of the mix, 72% of the X Museum collection vocabulary terms were matched

to LCSH. Determining how many collection records this mapping would affect would require

calculating how many records are assigned terms not matched to LCSH at all. The current study

does not have access to that information. Since the unmatched terms tend to be highly specific, it

is possible that those terms are not the only term assigned to any single object record. Therefore,

it's likely that the 74% of terms that matched to LCSH are used to describe a majority of objects

in the X Museum collection (not including those records with no subject terms assigned)

Quetsion 4: How much data manipulation and manual labor is required for the

reconciliation process? Most of the reconciliation of the X Museum collection subject

vocabulary to LCSH was completed manually. Only 10% of the terms were matched after the


automatic reconciliation process. The automatic process was very quick—after loading the

vocabulary into Open Refine, the reconciliation took about 30 minutes to complete on a standard

residential cable Internet connection. If the proper names are removed from the data set, that

number increases to 17% of terms matched. This is consistent with the findings of van Hooland

et. al (2013), who matched 12% of the Powerhouse Museum collection records with automatic

reconciliation using the LCSH LOD as it is provided by the Library of Congress (p. 471). The

"preprocessing" of the LCSH vocabulary data that these authors performed in order to increase

automatic matching seems to have been tailored specifically enough to the Powerhouse Museum

collection as to not be useful in this case study. The authors do not indicate how much time the

"preprocessing" took, but this can be seen essentially as a form of manual approach to the

reconciliation.

After profiling the data and performing the automatic reconciliation, the manual process

in this case study took about 10 hours for one person to complete. The Open Refine tool certainly

aided the manual reconciliation process by providing a search tool within its interface; creating

automatic links between the data in the system and the LCSH LOD; and providing tools for

sorting, creating facets, flagging, and editing records in bulk. Manual reconciliation of the X

Museum subject vocabulary involved comparing suggested terms to the X Museum collection

context and making decisions about whether a match was topical, and looking up terms in the

LCSH to find possible matches. This case study limited such searches to no more than three;

further research could suggest additional matches that would increase the percentage of

vocabulary terms matched to the LCSH vocabulary.


Appendix A--Rules for matching proper names in the X Museum vocabulary to LCSH subject headings

1. MATCHED if the name exists in LCSH as a subject headinga) IF the name is used as general marker for a historical period andb) IF there is 100% certainty about the intended person matching the LCSH term

e.g. Louis XIV

LCSH options: Paris (France)--History--Louis XIV, 1643-1715 Louis XIV, King of France, 1638-1715--Portraits Decoration and ornament--Louis XIV style France--History--Louis XIV, 1643-1715

Choose France--History--Louis XIV, 1643-1715

2. NOT matched if < 100% certainty about the person matching the term, or disambiguation among several options requires research.

e.g. Frederick I

LCSH options: Norway--History--Frederick I, 1523-1533 Denmark--History--Frederick I, 1523-1533 Germany--History--Frederick I, 1152-1190

3. NOT matched if the name is used in LCSH only as qualifier for a more specific concept

e.g. King David

LCSH options (David, Kind of Israel not available): David, King of Israel -- in art David, King of Israel -- in literature David, King of Israel -- in rabbinical literature

4. MATCHED names to a broader term if it is narrower than the next level up in the museum’s vocabulary hierarchy

e.g. All saints' names matched to Christian SaintsSaint Agnes matched to Christian SaintsSaint Andrew matched to Christian Saints

5. MATCHED mythological names, literary characters, legendary characters if available in LCSH.


References

Allinson, J. (2012). OpenART: Open Metadata for Art Research at the Tate. Bulletin of the

American Society for Information Science and Technology, 38(3).

Baca, M. (2007). CCO and CDWA Lite: Complementary Data Content and Data Format

Standards for Art and Material Culture Information. Visual Resources Association

Bulletin, 34(1), 69-75.

Barker, G. (2012, September 1). More on museum datasets, un-comprehensive-ness, data

mining. (Web log comment). Retrieved from

http://www.freshandnew.org/2012/08/museum-datasets-un-comprehensive-ness-data-

mining/

Berners-Lee, T. (2006, July 27). Linked Data - Design Issues. Retrieved from

http://www.w3.org/DesignIssues/LinkedData

Berners-Lee, T. (2009). Tim Berners-lee on the Next Web. (Video file). Retrieved from

http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html

Chan, S. (2012, August 23). More on museum datasets, un-comprehensive-ness, data mining.

(Web log). Retrieved from http://www.freshandnew.org/2012/08/museum-datasets-un-

comprehensive-ness-data-mining/

Coburn, E., Lanzi, E., O'Keefe, E., Stein, R., Whiteside, A. (2010). The Cataloging Cultural

Objects experience: Codifying practice for the cultural heritage community. IFLA

Journal (March 2010) (36)1, 16-29. doi:10.1177/0340035209359561

Elings, M. W., & Waibel, G. (2007). Metadata for All: Descriptive Standards and Metadata

Sharing across Cultural Heritage Communities. Visual Resources Association Bulletin,

34(1), 7-14.

http://www.freshandnew.org/2012/08/museum-datasets-un-comprehensive-ness-data-mining/


http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html

http://www.w3.org/DesignIssues/LinkedData




Free Your Metadata. (2013). freeyourmetadata.org

Goodlander, G. (2013, March 21). OpenGLAM: LOD and American Art. (Slide presentation).

http://www.slideshare.net/georginab/openglam-lod-and-american-art

Henry, D. & Brown, E. (2012). Using an RDF Data Pipeline to Implement Cross-Collection

Search. In D. Bearman & J. Trant (eds.) Museums and the Web 2012: Proceedings. San

Diego: Archives & Museum Informatics, 2012. Retrieved from

http://www.museumsandtheweb.com/mw2012/papers/using_an_rdf_data_pipeline_to_im

plement_cross_

Isaac, A. (2012, March 26). Europeana and Linked Open Data. (Web log). Retrieved from

http://blog.okfn.org/2012/03/26/europeana-and-linked-open-data/

Isaac, A., Clayphan, R., & Haslhofer, B. (2012). EUROPEANA: Moving to Linked Open Data.

Information Standards Quarterly, 24(2/3), 34-40. doi:10.3789/isqv24n2-3.2012.06

Improving Digital Record Annotation Capabilities with Open-sourced Ontologies and Crowd-

sourced Workers. (2012, June 5). (Press release). Retrieved from LexisNexis.

Kessler, B. (2007). Encoding Works and Images: The Story Behind VRA Core 4.0. Visual

Resources Association Bulletin, 34(1), 20-33.

Koolen, M., Kamps, J., & De Keijzer, V. (2009). Information Retrieval in Cultural Heritage.

Interdisciplinary Science Reviews, 34(2/3), 268-284. doi:10.1179/174327909X441153.

Library of Congress. (n.d.). LC Linked Data Service: Authorities and Vocabularies.

http://id.loc.gov/

Linked Open Data in Libraries, Archives and Museums. (2013). lod-lam.net

Museums and the Machine-Processable Web wiki. (2013). http://museum-api.pbworks.com/


http://id.loc.gov/

http://blog.okfn.org/2012/03/26/europeana-and-linked-open-data/

http://www.museumsandtheweb.com/mw2012/papers/using_an_rdf_data_pipeline_to_implement_cross_

http://www.museumsandtheweb.com/mw2012/papers/using_an_rdf_data_pipeline_to_implement_cross_

http://www.slideshare.net/georginab/openglam-lod-and-american-art


Museum APIs (2013). Retrieved from


OpenGLAM. (2013). http://openglam.org/

OpenRefine (n.d.). http://openrefine.org/

Purday, J. (2009). Think culture: Europeana.eu from concept to construction. The Electronic

Library, 27(6), 919 – 937.

RDF Refine. (n.d.). – A Google Refine extension for exporting RDF. http://refine.deri.ie/

STITCH. Semantic Interoperability To access Cultural Heritage. http://www.cs.vu.nl/STITCH/

USC tech experts to guide Smithsonian museum to next generation of the Internet (2012,

December 8). (Press release). Retrieved from LexisNexis.

van Hooland, S., De Wilde, M., Verborgh, R., Mannens, E., Van de Walle, R., & Hercher, J.

(2013). Evaluating the success of vocabulary reconciliation for cultural heritage

collections. Journal of the American Society of Information Science, 64(3): 464–479.

doi:10.1002/asi.22763

Waibel, Gunther. (2010). Museum Data Exchange: Learning How to Share. D-Lib Magazine,

16(3/4). doi:10.1045/march2010-waibel

Wikipedia. (2013). SPARQL. http://en.wikipedia.org/wiki/SPARQL

http://www.cs.vu.nl/STITCH/

http://refine.deri.ie/

http://openrefine.org/

http://openglam.org/


a method for aligning museum collections metadata · web viewalthough metadata standards for...

Documents