[powerpoint]

30
Metadata Issues for e-Prints: experiences from setting up an Institutional Repository Jessie Hey Research Fellow TARDis Project University of Southampton ePrints UK Workshop Ashmolean Museum Oxford 22 Mar 2004

Upload: ebayworld

Post on 27-Nov-2014

511 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: [PowerPoint]

Metadata Issues for e-Prints:experiences from setting up an

Institutional Repository

Jessie HeyResearch Fellow TARDis Project

University of Southampton

ePrints UK WorkshopAshmolean Museum Oxford

22 Mar 2004

Page 2: [PowerPoint]

e-Prints

A simple illustration of diversity in metadata!

• EPrints (software)• e-Prints (Soton)• ePrints (UK project)• eprints (in URLs, emails)• E-print (Network – US gateway)

Page 3: [PowerPoint]

Searching for e-Prints in Googlee-Prints 1,200,000; eprints 225,000

Page 4: [PowerPoint]

Plam pilot?

• Looking for a PDA?

• Just try searching for plam pilot on eBay

• Even a sale is not incentive enough

Page 5: [PowerPoint]

Metadata

• The modern word for ‘Data about data’

• Generally structured data describing an e-Print in this context

• Describing an object such as a journal article or book chapter or thesis

Page 6: [PowerPoint]

Metadata issues for today

• Who needs the quality?• What kind of quality?

• How we approached it in TARDis– the depositor– the process– classification– mediation

• Balancing demands the pragmatic way

Page 7: [PowerPoint]

Who needs the quality?

Service providers (i.e. search services)

• Analysis in both e-learning and e-prints communities showed concern about quality of metadata in individual databases to give good search results when combined in cross-domain search services

Barton, Jane, Currier, Sarah and Hey, Jessie M.N. (2003) Building quality assurance into metadata creation: an analysis based on the learning objects and e-Prints communities of practice. In: 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice - Metadata Research and Applications, DCMI, 39-48.http://eprints.soton.ac.uk/archive/00000020/

Page 8: [PowerPoint]

As I am in Oxford…

• a tribute in Elvish to JRR Tolkien from the Lord of the Rings

Page 9: [PowerPoint]

Gandalf on Dublin Core metadata

• ‘I cannot read the fiery letters,’ said Frodo in a quavering voice.

• ‘No’ said Gandalf ‘but I can. ……this in the Common Tongue is what is said, close enough:

• One Ring to rule them all, One Ring to find them,

• One Ring to bring them all and in the darkness bind them.’

Page 10: [PowerPoint]

Standards for e-Prints: Dublin Core Metadata Sets

• Define minimal metadata elements for simple resource discovery

e.g. title, creator, subject and keywords, publisher, date, rights management

• Fundamental building blocks for Open Archive Initiative compliant repositories

• Software such as GNU EPrints is OAI compliant (in DSpace may need ‘switching on’)

• Full text searching (in latest version) will give additional help to compensate for weaknesses

Page 11: [PowerPoint]

Who needs the quality?

• Academics (the depositors) need reasonable quality for their publication record whether full text is available or not– Tendency to think a good citation matters less if

access leads straight to the full text

An institutional repository needs• To represent their own work well• To represent their faculty and university well

• For publicity and communication• For research assessment and proposals• For promotion

Page 12: [PowerPoint]

What kind of quality?

• Fit for purpose – visibility and citability

• Rolls Royce or Volkswagon Golf or a Skoda?

• The Rolls Royce may not produce a sustainable repository

• Library of Congress had to think again with a backlog of millions

• A departmental archive had to scrap its editors (too slow)

• Need a model with a light touch

Page 13: [PowerPoint]

Examples to correct

From an academic’s current departmental publication record:

• Co-author given as Fadden on older references

• Given as McFadden on newer ones

• McFadden would not find all his papers!

Page 14: [PowerPoint]

Examples to correct

• Authors are not perfect but neither are information specialists or other sources

Recent examples:

• Author’s assistant put a conference in year 2400

• ‘Web of Knowledge’ put a conference in 2010

NB Amazon proved useful for checking book information from the title page (new Amazon ‘search inside’ service) but main entries may be less accurate

Page 15: [PowerPoint]

Quality Assurance Procedures

• Would like to pick up these and obvious examples of metadata in the wrong field eg book title used for title of chapter

• Options include regular checking (e.g at or close to time of deposit or for annual reporting) or random checking

• Visualisation techniques promising but still expensive

Page 16: [PowerPoint]

How we approached it in TARDis

• Looked at process from point of view of depositor– to decrease the barriers to deposit– to improve quality by design or example

• Looked at metadata required for a good citation– academics using e-print records for many purposes

not just visibility

• Some information may be easier to strip out if required but harder to add later e.g.– first name or initials – although cultural variations

too– journal title or abbreviation

Page 17: [PowerPoint]

Simple things deter

• Questions you can’t answer• No place to put it• Errors which force you to enter it again

• On a credit card payment– Date on the card: 06/05– Date to enter: 06/2005How many times do I do this incorrectly!

Page 18: [PowerPoint]

To help the depositor

• Aimed to enter information as the depositor sees it on the full text

• Arranged input in the order the information is seen

• With relevant information grouped together

• With ‘pages’ of daunting size• Fields of a size to view as much of the

text as possible

Page 19: [PowerPoint]

TARDis - Aiding deposit – relevant fields – relevant help

Page 20: [PowerPoint]

The Process

• Added help where examples are useful• Added extra buttons at top to ease

navigation• Made mandatory fields where essential• Tension between full details and

deterrent– commentary field currently not included

although some might find useful

Page 21: [PowerPoint]

Some ‘quality’ traditions may be less practical

• Search service recommendations: capitals only for first word of title except proper nouns

• Process is generally ‘cut and paste’ so result is variable and advice ignored

• Get Caps, non-caps, rarely ALL CAPS

• Found in practice likely to be too time consuming to insist

• Think retrieval first rather than consistency

Page 22: [PowerPoint]

Classification – a specific area of debate

• ePrints UK exploring automatic classification with Dewey

• TARDis looked at current practice: Reviewed subject classification in discipline

based and early institutional archivesFound whole variety of choices and levels

of complexity

Page 23: [PowerPoint]

TARDis on subject classification

• Discussion of issues and snapshot chart http://tardis.eprints.org

• Using basic Library of Congress with view to harvesting eg papers in Oceanography

• Added search box to find subject• Departments could use an additional scheme if they

wish (software option)• Keywords can be added (cut and paste) if available

(sometimes papers also have classification categories added for a journal)

• Computer classification generally expensive and requires learning examples but accuracy is improving

Page 24: [PowerPoint]

Towards the future – subject classification – on the fly

Page 25: [PowerPoint]

Mediation

• TARDis is experimenting with deposit choices

• Branch to:

– Self archiving (author or local assistant) with light review as pass through submission buffer

– Assisted archiving – give us the file with essential details not evident from the full text

Page 26: [PowerPoint]

Mediation in practice

• Current experience:

– Assisted archiving often time consuming – meeting the difficult ones - but can add value (e.g.fuller publisher location details such as DOI)

– Self archiving less accurate but author may know details which may be missing from full text

– Balance likely to change as authors become either more familiar with early deposit or perhaps happy to delegate to save time

– Learning curve for us – later may devolve some quality responsibility (use editorial options)

– Give additional feedback into software

Page 27: [PowerPoint]

The challenge of cutting and pasting from PDFs

• Sometimes rather like the Hyperbookworms (Jasper Fforde, The Eyre Affair)

• Who produce spurious capitals, apostrophes, hyphens

• Problems with hyphens, accents and words starting with f!

• LaTex usually the culprit so Humanities have an advantage here

Page 28: [PowerPoint]

Balancing demands the pragmatic way

• Author deposit changes the equation• Incentives can increase accuracy

– Deposit support– Requests by department or university or

funding council for up to date records

• Collaboration between author, department and information specialist may be best way forward

• Aim: light quality control to achieve visibility and citability

Page 29: [PowerPoint]

The New World of e-Prints

• Not so elegant to work in as an Oxford College Library such as Brasenose

• But should be just as satisfying to use as it meets new needs

Page 30: [PowerPoint]

Thank you

For further information:

TARDis http://tardis.eprints.org/

e-Prints Soton (Research Soton) http://eprints.soton.ac.uk/

FAIR Focus on Access to Institutional Resources Programme

"Improving the Quality of Metadata in Eprint Archives" Marieke Guy and Andy Powell Ariadne Issue 38 30-January-2004

Barton, Jane, Currier, Sarah and Hey, Jessie M.N. (2003) Building quality assurance into metadata creation: an analysis based on the learning objects and e-Prints communities of practice. In: 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice - Metadata Research and Applications, DCMI, 39-48.http://eprints.soton.ac.uk/archive/00000020/