digitisation at scale: automating the mass acquisition of digitised content
TRANSCRIPT
Digitisation at Scale:Automating the mass acquisition of
digitised contentIS&T Archiving Conference, Washington, April 2016
Dave ThompsonDigital Curator, Wellcome Library
The Wellcome Library
• Part of Wellcome Collection, astonishing public venue in London developed by the Wellcome Trust. Where people can learn more about medicine through the ages & across cultures
• Five-year plan for transforming the Wellcome Library.
Driver for digitisation
• To make our collections available to anyone, anywhere, we are digitising as much of our physical collection as we can, for both our website and the websites of other organisations. We are also digitising and hosting collections from partners that complement our holdings
Transforming the Wellcome Library: 2009-2014. http://wellcomelibrary.org/what-we-do/library-strategy-and-policy/transforming-the-wellcome-library/
The problem
• How to scale systems & processes to deliver on our ambition
• How to design & build new high volume systems & processes for; acquisition, storage, processing, access
• How to manage volumes of data during creation/acquisition
Process design – sources of content
Goobi(METS/OCR)
Preservica
In-house
Institutions
Contractors
Harvesting
TIFF or JP2
TIFF or JP2HD & ftp
TIFF or JP2
Normalises TIFF to JP2
Manual
Automatic
Jpylyzer validates JP2
Auto harvesting of JP2 & DMD
Grey literature
Ingest Officer / Digital Curator
Snagging
Snagging
The approach
• (Re)Use/develop existing systems were possible, e.g. bibliographic system Sierra, Preservica EE repository
• Identify where new systems would be required, e.g. workflow middle ware
• Take a practical approach & accept that it would be iterative learning as we go
The solution was to use Goobi
Why Goobi?
• Dedicated to digitisation
• Flexibility & process control
• Adaptable & scalable
• Vendor expertise/support
http://www.inspirelancs.org.uk/interested-in-volunteering-family-carers-volunteers-wanted/
Role of Goobi
• Role of Goobi is overall management & tracking of processes
• Initiate ingest into our DAM Preservica
• Reporting & statistics
Role of humans
• Working at volume did not imply more staff, it implied efficiency
• Also implied automation
• Human work was focussed on tasks machines couldn't do
http://planetivy.com/gaming/25273/natural-selection-2-gaming-evolution-in-action/
System & process design
• High volume doesn’t imply use of many systems
• Requires design to be as simple as possible, with as few moving parts as possible
• Processes need to be efficient & scalable, human as well as system
http://www.nivenswealthstrategies.com/keeping-it-simple/
Partnership for scalable digitisation
• Relationship with Internet Archive digitising our Library content
• High volume long term project
• Content harvested from Internet Archive website & processed automatically
• Dedicated Goobi process for fully automated harvesting
Harvesting from Internet Archive
Content processed automatically, including creation of METS & ALTO.
Goobi has a ‘repository’ of IA identifiers for searching/harvesting.
Goobi harvests data from Internet Archive website.
Content available in the player.Content stored in Preservica. DDS creates JSON for the player & pre-
caches some content.
Challenges - M&Ms
• Multi volume works
• No metadata to support their union
• Have to construct them manually, but process can be simplified
• Time consuming, still to be fully automated
Challenges – Working with partners
• Changes to Internet Archive website broke our harvesting
• For automated ftp to work 3rd parties need to follow instructions
• Creation of JPEG2000 images/video
• Incorrect identifiers trips up processes
Opportunities
• Working with IT, flexibility of virtualised environment
• Working with Intranda, brings in vendor expertise
• Distributed system brings in feedback from many users
• Small team simplifies decision making
• Success leads to success
Life cycle management
• Good place with regard to life cycle management
• Consistent processes based on common workflows
• Goobi outputs consistent & predictable
• Unified data set easier to manage in the future
Has automation been successful?
• Yes with a but
• Automation can be complex, easy to make mistakes
• Automation requires metadata to be available
• Automated processes still require a human minder
The scale of things
Lessons learned
• Complexity Vs simplicity
• Iterative approaches work but are time consuming
• Vendor support/input crucial when starting from scratch
• Process design essential
Be bold. Sometimes it’s the way we work that has to change
Thank you
Questions now, questions later…?
Dave Thompson, Digital CuratorWellcome Library
[email protected] @d_n_t
http://wellcomelibrary.org/