getting metadata to work harder: re-use, standardisation and streamlining, a data archive...

30
……………………………………………………………………………………………………………………………….…………………………….. …………………………………………………………………………………………………………………………………………………………..… UK DATA ARCHIVE GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ………………………………………………………. ………………………………............................................................................................ ...... LUCY BELL ………………………………………... MANAGEMENT INFORMATION MANAGER UK DATA ARCHIVE UNIVERSITY OF ESSEX ………………………………………... THE VALUE OF CATALOGUING, CIG 2012, UNIVERSITY OF SHEFFIELD 10 – 11 SEPTEMBER 2012

Upload: keagan-postlewaite

Post on 28-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective

……………………………………………………….………………………………..................................................................................................

LUCY BELL………………………………………...

MANAGEMENT INFORMATION MANAGERUK DATA ARCHIVEUNIVERSITY OF ESSEX………………………………………...

THE VALUE OF CATALOGUING, CIG 2012, UNIVERSITY OF SHEFFIELD

10 – 11 SEPTEMBER 2012

Page 2: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Introduction

• recent changes to 45 years’ worth of cataloguing and indexing – and indexing practices

• changes are large, wide-ranging – and still underway!

• we hope they will both enhance the user’s experience and create organisational efficiencies

Page 3: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Themes

• the UK Data Archive: what it is• current practice: metadata schema and tools used at

the Archive• recent internal initiatives• generally: the problems we encountered; the solutions

we have employed• next steps

Page 4: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

The UK Data Archive

• based at the University of Essex since 1967• curator of the largest collection of digital data in the

social sciences and humanities in the UK• holds several thousand datasets relating to society,

both historical and contemporary, making these available via its services:• UK Data Service from October 2012• previously, the Economic and Social Data Service

(ESDS)• it is a place of national deposit for The National

Archives• www.data-archive.ac.uk / (www.esds.ac.uk)

Page 5: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

The UK Data Archive: current cataloguing standards

• the Archive provides access to over 5000 digital data collections• all of these items are catalogued at study level, and

many at variable level• using the de facto standard data cataloguing schema,

DDI (Data Documentation Initiative, see http://www.ddialliance.org/)  

• currently, the Archive uses:• DDI 2.1 (now known as DDI-C, for codebook)• the Humanities and Social Science Electronic Thesaurus

(HASSET), © University of Essex, based on UNESCO• internally-controlled authority lists and CVs

Page 6: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

HASSET

• multidisciplinary thesaurus developed to support the UK Data Archive collection

• coverage in the core subject areas of social science disciplines

• uses standard hierarchical relationships: TT (top term); BT (broader term); NT (narrower term); RT (related term) etc.

• role of HASSET in the Archive is twofold:• used internally for indexing studies and series with HASSET

terms• also a separate product licensed to others

Page 7: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Significant recent metadata/indexing developments

1. May – October 2010: a review was carried out of the UK Data Archive’s resource discovery tools.• 2011: a project was started to apply the review’s results

to the Archive’s resource discovery applications.

2. 2011 onwards: work was started to move from the DDI-C to DDI-L (for lifecycle) metadata schema.

3. June 2012 – January 2013: SKOS-HASSET, a JISC-funded project is being undertaken to apply SKOS to HASSET and to test its automated indexing capacity

Page 8: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Shared requirements…

• it became clear that most of these initiatives were all pointing at one thing:

The need for more controlled - and harder-working - metadata

Page 9: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

1. Resource discovery review

• How do researchers find data?

• trends in information-seeking behaviour show that users prefer simple, Google-like interfaces…

• …but which still return acutely-focused and highly-relevant results.

• the look and feel of the interfaces should be simple but the results must achieve academic rigour.

Page 10: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Result of the review: the metadata conundrum

• for data services to produce simple interfaces - which still return highly-relevant results - metadata are required which are both:• extremely powerful• increasingly invisible

• a conceptual shift has taken place: the work to focus searches has moved behind the interface

Page 11: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

The previous Archive search context

ESDS Qualidatasearch interface

ESDS Internationalsearch interface

ESDS Government Survey Finder

SEARCHESDS Data catalogue

SEARCH(Data exploration)

Quali Online

SEARCH(Data exploration)

Nesstar

DATA

BROWSEMajor Studies

BROWSESubject Headings

BROWSESubject Headings

BROWSENew releases

BROWSEThematic pages

SEARCHRELU-DSS

SEARCHUKDA-Store

SEARCHCESSDA catalogue

ESDS Government

Variable Search

Variable SearchESDS Data Catalogue

ESDS Government: publications citing

ESDS International data

ESDS Longitudinal: publications citing ESDS Longitudinal

surveys

ESDS International: publications citing

ESDS International data

ESDS Longitudinalsearch interface

ESDS Qualidata free text search interface

ESDS Governmentsearch interface

HASSET

Comparable geography

(Long)

Comparable indicators

(Long)

Subject Headings

SEARCHSurvey Question

Bank

SEARCHCensus data

catalogue

SEARCHHDS

SEARCHSDS

HASSET and other

CVs may be used in the majority of search and

browse activities.

21 interfaces

Page 12: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

The vision: use CVs to enhance the user’s experience

• We wanted:• a single search interface• the ability to move seamlessly from one type of resource

to another:• via faceted browsing and• directly from within each resource type

• This required:• cross-referencing data collections with publications, with

research outputs, with support guides, with case studies using metadata

• Many controlled vocabularies!

Page 13: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

The result: single faceted search/browse interface

• We are moving from this:

• To this:

Page 14: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Facets needing controlled vocabularies

• Some were already in a fit state:• Depositor (existing authority list)• Country (existing authority list)

• Others needed mapping to high levels:• Subject categories (116 categories mapped to 21 top

terms)• Many were populated with freetext:

• Observation unit• Spatial unit• Kind of data• Time method

Page 15: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Freetext to controlled vocabularies mapping

• Mapping freetext values to controlled values (all metadata held in SQL tables)

• Same principles for all:• Obtain dump of metadata and manipulate in Excel• Identify CV to be used• Use Google Refine to identify existing, similar, freetext

entries• Re-export into Excel and apply mapping (at item level

or, if possible, at value level)• CVs to be used in the future

• So far, has taken 2 staff members, working c.0.4 FTE 4 months to clean 3 elements

Page 16: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

The mappings

• Spatial unit <geogUnit>• Previous Archive project, U.Geo, had created a spatial unit CV• 653 unique values, now mapped to 194• This has now been used for all items:

Page 17: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

The mappings

• Unit of observation <anlyUnit>• 183 unique values, now mapped to 11, using DDI CVG

recommended list:• Individuals• Organizations• Families/households• Housing Units• Events/Processes• Geographic Units• Time Units• Text units• Groups• Objects• Other

Page 18: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

The mappings

• Kind of data <dataKind>• 294 unique values, now mapped to 7:

• Alpha-numeric• Audio• GIS• Image• Numeric• Textual• Video

Page 19: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

The mappings

• More to come….• Method of data collection• Access/restrictions (Secure data; standard access

conditions etc.)• Method of access (Explore online or download)

• Faceted search/browse will be released as a beta in late 2012• More development will occur during its beta phase

following user feedback

Page 20: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

2. Metadata schema: DDI-C to DDI-L

• Simultaneously, the Archive has been preparing for the move from DDI-C to DDI-L • DDI-C is similar to a traditional metadata schema• DDI-L is more flexible – to the benefit of users:

• permits data as well as metadata to be encoded• captures survey lifecycles• gives users a fully-rounded view of a survey from

inception to results• broad and flexible, allowing groupings to be made – re-

use is key

• to support all this, it requires CVs to be used in several elements (the DDI Alliance Controlled Vocabularies Group is working on these)

Page 21: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

3. CVs for organisational efficiency: SKOS application

• JISC project: SKOS-HASSET

• 8 months (June 2012 – January 2013)• part of the JISC Research Tools Programme• Multi-disciplinary project team:

• Information Scientists, Data/text Mining Programmer, Linguist, RDF specialist, Developers

Page 22: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

SKOS-HASSET

• three aims:• apply SKOS to HASSET – making the thesaurus more

flexible• improve its online presence• test its automated indexing capabilities; corpora:

• questions• questionnaires• abstracts• publications

Page 23: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

SKOS-HASSET

• Progress so far:• SKOS has been applied to HASSET• Texts prepared for the automated indexing case study• Gold standard of manual indexing of questions is taking

place• TF/IDF, KEA and WEKA all being used for term

extraction – work underway• Next steps:

• SKOS product licensing

Page 24: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

SKOS-HASSET

• Communication:• SKOS-HASSET blog: http://hassetukda.wordpress.com/• [email protected] email list• Project web site:

http://www.data-archive.ac.uk/find/our-projects/skos-hasset

• Webinar planned for the winter• User guidance

• Please contribute, give feedback!

Page 25: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Developments… to issues … to improvements

• For users:• the faceted search/browse interface exposed a lack

of standardisation in the underlying metadata• …freetext terms have been used over 45 years; these

are now being standardised• ...rich freetext metadata has not been lost

• the move from DDI-C (DDI 2.1) to DDI-L (DDI 3.1) brings in a conceptually different type of schema to the users’ benefit…• …but which also requires more controlled vocabularies

Page 26: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Developments… to issues … to improvements

• For us:• Applying more CVs will provide efficiencies:

• ...the Archive wants to introduce an online deposit form for its depositors which will include CV dropdowns

• ...create more ways of suggesting terms for the cataloguers

• SKOS gives the opportunity to work more flexibly with the thesaurus• …automated indexing using CVs is being tested• ...SKOS will allow for easier future thesaurus

development

Page 27: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Analysis and reporting and future acquisitions decisions supported

The future: analysis and reporting enhanced

Additional metadata

created through text mining; geographic coordinates

Metadata results returned

User queries database

Evaluation of metadata systems

User questioned about usefulness of results

Oth

er, r

elat

ed te

rms

auto

mat

ical

ly s

earc

hed

‘just

-in-ti

me’

and

‘sim

ilar’

resu

lts r

etur

ned

Input programs automatically generate SN

user guides and title pages

Manual metadata

created, auto metadata

checked; record completed with

descriptors

Metadata record

Search and browse activity monitored to inform data acquisition

Results of user quality evaluation

of search analysed

Web deposit form captures

more and more controlled

metadata from depositors

Managem

ent Information

Managem

ent Information

Page 28: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Conclusion

• we all NEED metadata so that we can find stuff• there is too much stuff (or not enough bodies) to

create all the metadata ourselves in time these days• searchers/users often expect the applications to do the

work for them• use the tools at our disposal to make this happen by:

• employing more CVs where appropriate• sharing and using RDF-enabled CVs• and, crucially, continuing the creation of quality-assured

metadata using fewer resources

Page 29: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

Conclusion

• JISC Intrallect report; quotation from Vic Lyte:

• “A new researcher wishing to approach scholarly inquiry to determine the impact of global warming on penguin populations in South Antarctica doesn’t walk

up to a Librarian and shout ‘Penguins!’.”

(Duncan, C. & Douglas, P., (2009). Automatic metadata generation: use cases and tools/priorities. Intrallect (for JISC): 2009)

Page 30: GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective ……………………………………………………….………………………………

……………………………………………………………………………………………………………………………….……………………………..

…………………………………………………………………………………………………………………………………………………………..…

UK DATA ARCHIVE

CONTACT

UK DATA ARCHIVEUNIVERSITY OF ESSEXWIVENHOE PARKCOLCHESTERESSEX CO4 3SQ……..……………………………….…..T +44 (0)1206 872001 E [email protected]