dr. ross king ait austrian institute of technology gmbh scape/opf executive seminar: managing...
TRANSCRIPT
Dr. Ross KingAIT Austrian Institute of Technology GmbH
SCAPE/OPF Executive Seminar: Managing Digital PreservationThe Hague, April 2, 2014
SCAPETools and Solutions
• SCAPE Project• SCAPE Tools• SCAPE Solutions• SCAPE and Preservation Management• SCAPE Additional Information
• Online Resources• Events• Contact Information
2
Outline
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
SCAPE – what is it about?• Planning and executing computing-intensive digital preservation
processes such as the large-scale ingestion, characterisation or migration of large (multi-Terabyte) and complex data sets
• SCAPE results include• Preservation scenarios• Preservation tools• Preservation workflows• Preservation infrastructure• Preservation best-practices
SCAPE is a follow-up to the highly successful FP6 IP Planets.
3
SCAPE Project Data• Project instrument: FP7 Collaborative Project• 20 Partners from 11 countries• 6. Call
• Objective ICT-2009.4.1: Digital Libraries and Digital Preservation• Target outcome (a) Scalable systems and services for preserving
digital content• 10. Call
• Objective ICT-2013.11.4: Supplements to Strengthen Cooperation in ICT R&D in an Enlarged European Union
• Duration: 44 months• February 2011 – September 2014
• Budget: 12.0 Million Euro• Funded: 9.2 Million Euro
4
SCAPE Consortium
5
SCAPE Tools
6
• Toolwrapper• Application that adapts existing tools to the SCAPE Platform
• https://github.com/openplanets/scape-toolwrapper
• Enhances wrapped tools• Standard naming scheme for CC, AS and QA tools• Standard invocation method (CLI)• Debian packages for easy deployment on the cluster• Support for data streaming (useful for Hadoop jobs)
• Generates Preservation Components • Taverna workflows with embedded metadata for easy discovery• Automatic publication of components on myExperiment (to support discoverability)• Standard ports to enable composition of Preservation Components (based on well defined component profiles,
CC, AS & QA)
• Digital Preservation Toolkit• Software suite that contains a large set of DP tools
• 77 operations in total
• Easy to deploy on Linux machines (via apt-get)• apt-get install digital-preservation-tools
7
Scalable Tools
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• Jpylyzer• JP2 (JPEG 2000 Part 1) validator and properties extractor• http://openplanets.github.io/jpylyzer/
• Pagelyzer• Suite of tools for detecting changes in web pages and their rendering• http://openplanets.github.io/pagelyzer/
• xcorrSound• Suite of tools for automated quality assurance of audio migration processes• https://github.com/openplanets/scape-xcorrsound
• Matchbox• Duplicate image detection tool• http://openplanets.github.io/matchbox/
• ToMaR• Supports the scalable execution preinstalled tools or other applications• Wraps command-line invocation of a tool into a MapReduce program• https://github.com/openplanets/scape/tree/master/pt-mapred
8
Scalable Tools
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• SCOUT: an automated preservation watch system• Enables planning tool and decision makers to monitor the world and the organisation• Collects relevant knowledge and enable automated notification• Open and extensible
• c3po: scalable content profiling• c3po analyses characterisation data based on fits• Scale-out MongoDB (100k/min/node)• Visual drill-down and well-documented profile• Automated sample selection
• PLATO 4.4: scalable preservation planning• www.ifs.tuwien.ac.at/dp/plato• Technology upgrade - refactored, rebuilt, standardised, tested • New features
• Groups allow collaborative planning• Integration of control policies for group• Quality domain – measures
9
Planning and Watch Tools
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• Fedora 4.0.0• All REST, no SOAP• RDF as first class objects• JCR 2.0 Implementation (ModeShape)• Infinispan distributed NoSQL datastore
• RODA• KEEP Solutions’ open source repository• Implements all SCAPE APIs
• Rosetta• Ex Libris ’ commercial long-term preservation system• Implements SCAPE Data Connector API
10
Repositories
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
11
SCAPE Architecture
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
SCAPE Components
3rd Party Components
1 2 n
HDFS
Hadoop
...
PigToMaR
3rd Party Componentswith SCAPE contributions
STAGER
LOADER
Fedora 4RosettaRODA
Taverna
Data Connector APISCAPE APIs
PPL
toolspecDigital Objects
SCAPE Platform
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
SCAPE Components
3rd Party Components
1 2 n
HDFS
Hadoop
...
PigToMaR
3rd Party Componentswith SCAPE contributions
STAGER
LOADER
Fedora 4RosettaRODA
Taverna
Data Connector APISCAPE APIs
PPL
toolspec
Tool wrapper
Components
Digital Objects
Preservation Tools
SCAPE Platform + Preservation Components
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
SCAPE Components
3rd Party Components
1 2 n
HDFS
Hadoop
...
PigToMaR
3rd Party Componentswith SCAPE contributions
STAGER
LOADER
Fedora 4RosettaRODA
PLATO 4 Taverna
Data Connector API
Report API
Plan Management
API
SCOUT
SCAPE APIs
PPL
toolspec
Tool wrapper
Components
Digital Objects
Preservation Tools
SCAPE Planning and Watch
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
SCAPE Solutions
see alsohttp://wiki.opf-labs.org/display/SP/SCAPE+Stories
15
• User StoryAs a curator of image files, I need a digital preservation system that can migrate a large number of images from one format to another, ensuring that the migrated images conform to our institutional profile, that no image data is lost and that the migration is cost effective (saving storage for example).
• SCAPE Solution• SCAPE Platform• ImageMagick (with SCAPE toolspec description)• Jpylyzer
16
Migration: Large Scale Image Migration
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• User StoryAs the owner of a large audio collection, I need a digital preservation system that can migrate large numbers of audio files from one format to another and ensure that the migration is a good and complete copy of the original.
• SCAPE Solution• SCAPE Platform• xcorrSound
17
Migration: Large Scale Audio Migration
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• User StoryAs a Web Archive I need a Digital Preservation System that can process both ARC and WARC files and identify file formats/characterize of items contained so that I can assess preservation risks and plan which tools will be required for access to those formats.
• SCAPE Solution• SCAPE Platform• ARC Unpacker• FITS Tool (with SCAPE toolspec description)
18
Analysis: File Format Identification and Characterisation of Web Archives
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• User StoryIn order to be confident that we have preserved a website we need a digital preservation system that can automate the comparison of the two Web Snapshots - for example a harvested copy and a previous harvested copy that has been manually verified as an accurate representation of the site. This will enable us to ensure Web content has been successfully harvested and inform harvesting policies.
• SCAPE Solution• Pagelyzer• Hadoop Platform
19
Quality Control: Comparison of Web Snapshots
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• Open Source Development• And/or implementation of open APIs
• Uniform Deployment• Use the SCAPE Toolspec+Toolwrapper to publish tools
• As Advanced Packaging Toolkit (APT) packages• As SCAPE Components
• Preservation Planning• Use PLATO to test tools (as SCAPE Components) and make policy-based plans
• Process Modelling• Use Taverna to model preservation workflows
• Taverna works directly with SCAPE components for experimental workflows• Taverna workflows can be converted to Hadoop/Pig workflows in some cases
• Hadoop Deployment• Use APT packages to deploy to a Hadoop environment
• Scalable Execution• SCAPE ToMaR can directly access tools through the toolspec
20
Solving Preservation Problems the SCAPE Way
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
from digitalbevaring.dk
SCAPE and Preservation Management
21
Research and Development
• Focus on innovation
• Services are prototypes• Unstable• Buggy• Maintenance pool limited to
a few (or one) expert(s)
22
Production
• Focus on daily business needs
• Service availability is a priority• Services are stable• Enjoy a large maintenance pool
The Wall
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
Research and Development
23
Production
The Wall
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
1 2 n
HDFS
Hadoop
...
PigToMaR
FedoraRosettaRODA
Digital Objects
Research and Development
24
Production
The Wall
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
1 2 n
HDFS
Hadoop
...
PigToMaR
STAGER
LOADER
FedoraRosettaRODA
Data Connector API
Digital Objects
• Other problems with The Wall?
• How can we break through The Wall?
25
The Wall
This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
SCAPE Additional Information
26
Additional Resources of Interest• Development Infrastructure
• Code repository hosted by the Open Planets Foundation and GitHub• https://github.com/openplanets/scape/
• Development Wiki• http://wiki.opf-labs.org/display/SP/Home
• Experimental Workflows• http://www.myexperiment.org/search?query=SCAPE&type=all&commit=Search
• Publications• http://www.scape-project.eu/category/publication
• Public Deliverables• http://www.scape-project.eu/category/deliverable
• Tools• http://www.scape-project.eu/tools
27
SCAPE Events
• DL2014: Joint SCAPE/APARSEN Workshop• September 8, 2014, London• Registration: http://scape-future-formats-first.eventbrite.co.uk/
28
See http://www.scape-project.eu/events
SCAPE Contact Information
• http://www.scape-project.eu/• Twitter: #scapeproject• [email protected]
• Dr. Ross KingAIT Austrian Institute of Technology GmbHDonau-City-Strasse 1A-1220 Wien
29
Thank you for your attention!
Questions?
30