enrichment and structuring of archival description metadata

ACL/LaTeCH-Portland, June 24th 2011

Enrichment and Structuring of Archival Description

Metadata

Kalliopi Zervanou*, Ioannis Korkontzelos**,

Antal van den Bosch* & Sophia Ananiadou** * Tilburg Centre for Cognition &

CommunicationThe University of Tilburg, NL

K.Zervanou@uvt.nl Antal.vdnBosch@uvt.nl

** National Centre for Text MiningThe University of Manchester, UK

Ioannis.Korkontzelos@manchester.ac.uk Sophia.Ananiadou@manchester.ac.uk

Research on Metadata• Developing standards:

– collection specific (e.g. EAD, MARC21)– cross-collection (e.g. Dublin Core)

• Provide mappings: – across schemas– ontologies (ad hoc or standard CDOC-CRM)

• Discard metadata for IR (Koolen et al., 2007)

• Exploit metadata for IR (Zhang&Kamps, 2009)

The IISH EAD dataset• EAD: XML standard for encoding archival

descriptions

• Challenges: – Variety of languages used– Varying type and amount of information– Style: enumerations, lists, incomplete

sentences

Motivation & Objectives• Improved search and retrieval

– content-based metadata document clustering

– content-based/semantic search– support exploratory search– link across collections, metadata formats &

institutions– create unified metadata knowledge

resources

Method overview

Pre-processing• EAD/XML element selection & extraction

– EAD elements containing free-text & archive content information

• Language identification (n-gram method)– Identifier trained on Europarl corpus

• Text snippets length: ~20 tokens

Snippet length based on language

Method overview

Enrichment & Structuring• Topic detection: Automatic term

recognition using C-value method

• Agglomerative hierarchical term clustering:– complete, single & average linkage criteria– document co-occurence & lexical similarity

measures

Method overview

Term results (auto eval)

Results• C-value best performance: candidates that

occur as non-nested at least once

• Average linkage criterion & Doc Co-occurence: provide broader and richer hierarchies

Questions?Check-out our poster!

enrichment and structuring of archival description metadata

exploit metadata

tilburg centre

xml standard

collection specific

ir zhangkamps

ir koolen

national centre

automatic term recognition

Documents

the archival advantage: integrating archival expertise...

transformation and enrichment: activating archival...

archival certification

hec-9 2005 archival - october by superceded · archival...

connecting archival collections: the social networks and...

journal of archival organization the archival photograph...

recognition and enrichment of archival documents · bentham...

archival archaeology

recognition and enrichment of archival...

content analysis and archival research -...

archival statistics

sub structuring in ansys sub structuring in ansys

structuring patent indemnification...

· 2 table of contents i introduction ii archival matters...

publishing archival photographs · publishing archival...

structuring successful joint ventures: navigating...

structuring patent indemnification...

archival technologies

archival - lva.virginia.gov

structuring physician compensation...