scalable preservation workflows

65
Rainer Schmidt DP Advanced Practitioners Training July 16th, 2013 University of Glasgow Scalable Preservation Workflows design, parallelisation, and execution

Upload: scape-project

Post on 14-May-2015

151 views

Category:

Technology


0 download

DESCRIPTION

Rainer Schmidt, AIT Austrian Institute of Technology, presented Scalable Preservation Workflows from SCAPE at the 5-days ‘Digital Preservation Advanced Practitioner Training’ event (http://bit.ly/1fYCvMO), hosted by DPC, in Glasgow on 15-19 July 2013. The presentation gives an introduction to the SCAPE Platform, it presents scenarios from SCAPE Testbeds and it finally describes how to create scalable workflows and execute them on the SCAPE Platform.

TRANSCRIPT

Page 1: Scalable Preservation Workflows

SCAPE

Rainer Schmidt DP Advanced Practitioners Training July 16th, 2013 University of Glasgow

Scalable Preservation Workflows design, parallelisation, and execution

Page 2: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

2

• European Commission FP7 Integrated Project • 16 Organizations, 8 Countries • 42 months: February 2011 – July 2014 • Budget: 11.3 Million Euro (8.6 Million Euro funded)

• Consortium: data centers, memory institutions, research centers, universities & commercial partners • recently extended to involve HPC computing centers.

• Dealing with (digital) preservation processes at scale • such as ingestion, migration, analysis and monitoring

of digital data sets • Focus on scalability, robustness, and automation.

The Project

Page 3: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

3

What I will show you

• Example Scenarios from the SCAPE DL Testbed and how they are formalized using Workflow Technology

• Introduction to the SCAPE Platform. Underlying technologies, preservation services, and how to set-up.

• How is the paradigm different to a client-server set-up and can I execute a standard tool against my data.

• How to create scalable workflows and execute them on the platform.

• A practical demonstration (and available VM) for creating and running such workflows.

Page 4: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Example Scenarios and workflows

Page 5: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

5

• Ability to process large and complex data sets in preservation scenarios

• Increasing amount of data in data centers and memory institutions

Volume, Velocity, and Variety

of data

1970 2000 2030

cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge. available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx

Motivation

Page 6: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Austrian National Library (ONB)

• Web Archiving • Scenario 1: Web Archive Mime Type Identification

• Austrian Books Online • Scenario 2: Image File Format Migration • Scenario 3: Comparison of Book Derivatives • Scenario 4: MapReduce in Digitised Book Quality Assurance

Page 7: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

• Physical storage 19 TB • Raw data 32 TB • Number of objects

1.241.650.566

• Domain harvesting • Entire top-level-domain

.at every 2 years

• Selective harvesting • Interesting frequently

changing websites

• Event harvesting • Special occasions and

events (e.g. elections)

Web Archiving - File Format identification

Page 8: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

• Public private partnership with Google Inc. • Only public domain

• Objective to scan ~ 600.000 Volumes • ~ 200 Mio. pages

• ~ 70 project team members • 20+ in core team

• ~ 130K physical volumes scanned • ~ 40 Mio pages

Austrian Books Online

Page 9: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Digitisation Download & Storage

Quality Control Access

9

https://confluence.ucop.edu/display/Curation/PairTree

Google Public Private Partnership

ADOCO

Page 10: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

• Task: Image file format migration • TIFF to JPEG2000 migration

• Objective: Reduce storage costs by reducing the size of the images

• JPEG2000 to TIFF migration • Objective: Mitigation of the JPEG2000

file format obsolescense risk

• Challenges: • Integrating validation, migration,

and quality assurance • Computing intensive quality

assurance

Image file format migration

Page 11: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Comparison of book derivatives – Matchbox tool • Quality Assurance for different book versions

• Images have been manipulated (cropped, rotated) and stored in different locations

• Images subject to different modification procedures

• Detailed image comparison and detection of near duplicates and corresponding images • Feature extraction invariant under color

space, scale, rotation, cropping • Detecting visual keypoints and

structural similarity

• Automated Quality Assurance workflows • Austrian National Library - Book scan project • The British Library - “Dunhuang” manuscripts

Page 12: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Data Preparation and QA

• Goal: Preparing large document collections for data analysis. • Example: Detecting quality issues due to cropping errors. • Large volumes of HTML files generated as part of a book

collection • Representing layout and text of corresponding book page • HTML tags representing e.g. width and height of text or image block

• QA Workflow using multiple tools • Generate image metadata using Exiftool • Parse HTML and calculate block size of book page • Normalize data and put into data base • Execute query to detect quality issues

Page 13: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

The SCAPE Platform

Page 14: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Goal of the SCAPE Platform

• Hardware and software platform to support scalable preservation in terms of computation and storage. • Employing an scale-out architecture to supporting

preservation activities against large amounts of data. • Integration of existing tools, workflows, and

data sources and sinks.

• A data center service providing a scalable execution and storage backend for different object management systems. • Based a minimal set of defined services for • processing tools and/or queries closely to the data.

Page 15: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Underlying Technologies

• The SCAPE Platform is built on top of existing data-intensive computing technologies. • Reference Implementation leverages Hadoop Software Stack (HDFS,

MapReduce, Hive, …)

• Virtualization and packaging model for dynamic deployments of tools and environments • Debian packages and IaaS suppot.

• Repository Integration and Services • Data/Storage Connector API (Fedora and Lily) • Object Exchange Format (METS/PREMIS representation)

• Workflow modeling, translation, and provisioning. • Taverna Workbench and Component Catalogue • Workflow Compiler and Job Submission Service

Page 16: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

16

Components of the Platform

• Execution Platform • Deploy SCAPE tools and parallelized (WF) applications • Executable via CLI and Service API • Scripts/Drivers aiding integration.

• Workflow Support • Describe and validate preservation workflows using a

defined component model • Register and semantic search using Component Catalogue

• Repository Integration • Fedora implementation on top of CI • Loader Application, Object Model, and Connector APIs.

Page 17: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Architectural Overview (Core)

Component Catalogue

Workflow Modeling Environment

Component Lookup API

Component Registration API

Page 18: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Architectural Overview (Core)

Component Catalogue

Workflow Modeling Environment

Component Lookup API

Component Registration API

Focus of this talk

Page 19: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Hadoop Overview

Page 20: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

• Open-source software framework for large-scale data-intensive computations running on large clusters of commodity hardware.

• Derived from publications Google File System and MapReduce publications.

• Hadoop = MapReduce + HDFS • MapReduce: Programming Model (Map, Shuffle/Sort,

Reduce) and Execution Environment. • HDFS: Virtual distributed file system overlay on top of local

file systems.

Hadoop Overview #1

Page 21: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

• Designed for write one read many times access model. • Data IO is handled via HDFS.

• Data divided into blocks (typically 64MB) and distributed and replicated over data nodes.

• Parallelization logic is strictly separated from user program. • Automated data decomposition and communication between

processing steps. • Applications benefit from built-in support for data-locality and

fail-safety . • Applications scale-out on big clusters processing very large data

volumes.

Hadoop Overview #2

Page 22: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Sort

Shuffle

Merge

Input data

Input split 1

Record 1

Record 2

Record 3

Input split 2

Record 4

Record 5

Record 6

Input split 3

Record 7

Record 8

Record 9

MapReduce/Hadoop in a nutshell

22

Task1

Task 2

Task 3

Output data

Aggregated Result

Aggregated Result

Map Reduce

Page 23: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

MapReduce/Hadoop in a nutshell

23

Map Reduce

Map takes <k1, v1> and transforms it to <k2, v2> pairs

Sort

Shuffle

Merge

Input data

Input split 1

Record 1

Record 2

Record 3

Input split 2

Record 4

Record 5

Record 6

Input split 3

Record 7

Record 8

Record 9

23

Task1

Task 2

Task 3

Output data

Aggregated Result

Aggregated Result

Page 24: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Sort

Shuffle

Merge

Input data

Input split 1

Record 1

Record 2

Record 3

Input split 2

Record 4

Record 5

Record 6

Input split 3

Record 7

Record 8

Record 9

MapReduce/Hadoop in a nutshell

24

Task1

Task 2

Task 3

Output data

Aggregated Result

Aggregated Result

Map Reduce

Shuffle/Sort takes <k2, v2> and transforms it to <k2, list(v2)>

Page 25: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Sort

Shuffle

Merge

Input data

Input split 1

Record 1

Record 2

Record 3

Input split 2

Record 4

Record 5

Record 6

Input split 3

Record 7

Record 8

Record 9

MapReduce/Hadoop in a nutshell

25

Task1

Task 2

Task 3

Output data

Aggregated Result

Aggregated Result

Map Reduce

Reduce takes <k2, list(v2)> and transforms it to <k3, v3)>

Page 26: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Cluster Set-up

Page 27: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Platform Deployment • There is no prescribed deployment model

• Private, institutionally-shared, external data center • Possible to deploy on “bare-metal” or using

virtualization and cloud middleware.

• Platform Environment packaged as VM image • Automated and scalable deployment. • Presently supporting Eucalyptus (and AWS) clouds.

• SCAPE provides two shared Platform instances • Stable non-virtualized data-center cluster • Private-cloud based development cluster

• Partitioning and dynamic reconfiguration

Page 28: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Deploying Environments • IaaS enabling packaging and dynamic deployment of (complex)

Software Environments • But requires complex virtualization infrastructure

• Data-intensive technology is able to deal with a constantly varying number of cluster nodes. • Node failures are expected and automatically handled • System can grow/shrink on demand

• Network Attached Storage solution can be used as data source • But does not scalability and performance needs for computation

• SCAPE Hadoop Clusters • Linux + Preservation tools + SCAPE Hadoop libraries • Optionally Higher-level services (repository, workflow, …)

Page 29: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

ONB Experimental Cluster Job Tracker Task Trackers

Data Nodes

Name Node

CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores) RAM: 16GB DISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB effective • Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system. 25 processing cores for Map tasks and 10 cores for Reduce tasks

CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores) RAM: 24GB DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective

Page 30: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

SCAPE Shared Clusters

• AIT (dev. cluster) • 10 dual core nodes, 4 six-core

nodes, ~85 TB disk storage. • Xen and Eucalyptus virtualization

and cloud management

• IMF (central instance) • Low consumption machines in

NoRack column • dual core AMD 64-bit processor,

8GB RAM, 15TB on 5 disks • production data center facility

Page 31: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Using the Cluster

Page 32: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

32

• Wrapping Sequential Tools • Using a wrapper script (Hadoop Streaming API) • PT’s generic Java wrapper allows one to use pre-defined

patterns (based on toolspec language) • Works well for processing a moderate number of files

• e.g. applying migration tools or FITS. • Writing a custom MapReduce application

• Much more powerful and usually performs better. • Suitable for more complex problems and file formats, such

as Web archives. • Using a High-level Language like Hive and Pig

• Very useful to perform analysis of (semi-)structured data, e.g. characterization output.

Page 33: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

• Preservation tools and libraries are pre-packaged so they can be automatically deployed on cluster nodes • SCAPE Debian Packages • Supporting SCAPE Tool Specification Language

• MapReduce libs for processing large container files • For example METS and (W)arc RecordReader

• Application Scripts • Based on Apache Hive, Pig, Mahout

• Software components to assemble a complex data-parallel workflows • Taverna and Oozie Workflows

Available Tools

Page 34: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

34

Sequential Workflows

• In order to run a workflow (or activity) on the cluster it will have to be parallelized first!

• A number of different parallelization strategies exist • Approach typically determined on a case-by-case basis • May lead to changes of activities, workflow structure, or

the entire application. • Automated parallelization will only work to a certain degree

• Trivial workflows can be deployed/executed using without requiring individual parallelization (wrapper approach).

• SCAPE driver program for parallelizing Taverna workflows. • SCAPE template workflows for different institutional

scenarios developed.

Page 35: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

35

Parallel Workflows • Are typically derived from sequential (conceptual) workflows

created for desktop environment (but may differ substantially!).

• Rely on MapReduce as the parallel programming model and Apache Hadoop as execution environment

• Data decomposition is handled by Hadoop framework based on input format handlers (e.g text, warc, mets-xml, etc. )

• Can make use of a workflow engine (like Taverna and Oozie) for orchestrating complex (composite) processes.

• May include interactions with data mgnt. sytems (repositories) and sequential (concurrently executed) tools.

• Tools invocations are based on API or cmd-line interface and performed as part of a MapReduce application.

Page 36: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

MapRed Tool Wrapper

Page 37: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

37

Tool Specification Language • The SCAPE Tool Specification Language (toolspec) provides a

schema to formalize command line tool invocations. • Can be used to automate a complex tool invocation (many

arguments) based on a keyword (e.g. ps2pdfs) • Provides a simple and flexible mechanism to define tool

dependencies, for example of a workflow. • Can be resolved by the execution system using Linux

packages. • The toolspec is minimalistic and can be easily created for

individual tools and scripts. • Tools provided as SCAPE Debian packages come with a

toolspec document by default.

Page 38: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

38

Ghostscript Example

Page 39: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

39

MapRed Toolwrapper • Hadoop provides scalability, reliability, and robustness

supporting processing data that does not fit on a single machine. • Application must however be made compliant with the

execution environment. • Our intention was to provide a wrapper allowing one to

execute a command-line tool on the cluster in a similar way like on a desktop environment. • User simply specifies toolspec file, command name, and payload

data. • Supports HDFS references and (optionally) standard IO streams.

• Supports the SCAPE toolspec to execute preinstalled tools or other applications available via OS command-line interface.

Page 40: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

40

Hadoop Streaming API • Hadoop streaming API supports the execution of scripts (e.g.

bash or python) which are automatically translated and executed as MapReduce applications. • Can be used to process data with common UNIX filters using

commands like echo, awk, tr. • Hadoop is designed to process its input based on key/value

pairs. This means the input data is interpreted and split by the framework. • Perfect for processing text but difficult to process binary data.

• The steaming API uses streams to read/write from/to HDFS. • Preservation tools typically do not support HDFS file pointers

and/or IO streaming through stdin/sdout. • Hence, DP tools are almost not usable with streaming API

Page 41: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

41

Suitable Use-Cases • Use MapRed Toolwrapper when dealing with (a large number

of) single files. • Be aware that this may not be an ideal strategy and there

are more efficient ways to deal with many files on Hadoop (Sequence Files, Hbase, etc. ).

• However, practical and sufficient in many cases, as there is no additional application development required.

• A typical example is file format migration on a moderate number of files (e.g. 100.000s), which can be included in a workflow with additional QA components.

• Very helpful when payload is simply too big to be computed on a single machine.

Page 42: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

42

Example – Exploring an uncompressed WARC • Unpacked a 1GB WARC.GZ on local computer

• 2.2 GB unpacked => 343.288 files • `ls` took ~40s, • count *.html files with `file` took ~4 hrs => 60.000 html files

• Provided corresponding bash command as toolspec: • <command>if [ "$(file ${input} | awk "{print \$2}" )" == HTML ]; then echo

"HTML" ; fi</command>

• Moved data to HDFS and executed pt-mapred with toolspec. • 236min on local file system • 160min with 1 mapper on HDFS (this was a surprise!) • 85min (2), 52min (4), 27min (8) • 26min with 8 mappers and IO streaming (also a surprise)

Page 43: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

43

Ongoing Work • Source project and README on Github presently under

openplanets/scape/pt-mapred*

• Will be migrated to its own repository soon. • Presently required to generate an input file that specifies input

file paths (along with optional output file names). • TODO: Input binary directly based on input directory path

allowing Hadoop to take advantage of data locality. • Input/output steaming and piping between toolspec

commands has already been implemented. • TODO: Add support for Hadoop Sequence Files. • Look into possible integration with Hadoop Streaming API.

* https://github.com/openplanets/scape/tree/master/pt-mapred

Page 44: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Example Workflows

Page 45: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

45

What we mean by Workflow • Formalized (and repeatable) processes/experiments consisting

of one or more activities interpreted by a workflow engine. • Usually modeled as DAGs based on control-flow and/or

data-flow logic. • Workflow engine functions as a coordinator/scheduler that

triggers the execution of the involved activities • May be performed by a desktop or server-sided

component. • Example workflow engines are Taverna workbench, Taverna

server, and Apache Oozie. • Not equally rich and designed for different purposes:

experimentation & science, SOA, Hadoop integration.

Page 46: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

46

Taverna • A workflow language and graphical editing environment based

on a dataflow model. • Linking activities (tools, web services) based on data pipes.

• High level workflow diagram abstracting low level implementation details • Think of workflow as a kind of a configurable script. • Easier to explain, share, reuse and repurpose.

• Taverna workbench provides a desktop environment to run instances of that language. • Workflows can also be run in headless and server mode.

• It doesn't necessarily run on a grid, cloud, or cluster but can be used to interact with those resources.

Page 47: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

47

• Extract TIFF Metadata with Matchbox and Jpylyzer

• Perform OpenJpeg TIFF to JP2 migration

• Extract JP2 Metadata with Matchbox and Jpylyzer

• Validation based on Jpylyzer profiles

• Compare SIFT image features to test visual similarity

• Generate Report

Image Migration #1

Page 48: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

48

• No significant changes in workflow structure compared to sequential workflow.

• Orchestrating remote activities using Taverna’s Tool Plugin over SSH.

• Using Platform’s MapRed toolwrapper to invoke cmd-line tools on cluster

Image Migration #2

Command: hadoop jar mpt-mapred.jar -j $jobname -i $infile -r toolspecs

Page 49: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

WARC Identification #1 (W)ARC Container

JPG

GIF

HTM

HTM

MID

(W)ARC RecordReader

based on HERITRIX

Web crawler read/write (W)ARC

MapReduce

JPG Apache Tika detect MIME

Map Reduce

image/jpg

image/jpg 1 image/gif 1 text/html 2 audio/midi 1

Tool integration pattern Throughput (GB/min) TIKA detector API call in Map phase 6,17 GB/min FILE called as command line tool from map/reduce 1,70 GB/min TIKA JAR command line tool called from map/reduce 0,01 GB/min

Amount of data Number of ARC files

Throughput (GB/min)

1 GB 10 x 100 MB 1,57 GB/min 2 GB 20 x 100 MB 2,5 GB/min 10 GB 100 x 100 MB 3,06 GB/min 20 GB 200 x 100 MB 3,40 GB/min 100 GB 1000 x 100 MB 3,71 GB/min

Page 50: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

WARC Identification #2

TIKA 1.0 DROID 6.01

Page 51: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

• ETL Processing of 60.000 books, ~ 24 Million pages • Using Taverna‘s „Tool service“ (remote ssh execution) • Orchestration of different types of hadoop jobs

• Hadoop-Streaming-API • Hadoop Map/Reduce • Hive

• Workflow available on myExperiment: http://www.myexperiment.org/workflows/3105

• See Blogpost: http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-processing-chaining-hadoop-jobs-using-taverna

Quality Assurance #1

Page 52: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

52

• Create input text files containing file paths (JP2 & HTML)

• Read image metadata using Exiftool (Hadoop Streaming API)

• Create sequence file containing all HTML files

• Calculate average block width using MapReduce

• Load data in Hive tables • Execute SQL test query

Quality Assurance #2

Page 53: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

53

Quality Assurance – Using Apache Oozie

Page 54: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

54

Quality Assurance #3 – Using Apache Oozie

• Remote Workflow scheduler for Hadoop

• Accessible via REST interface • Control-flow oriented

Workflow language • Well integrated with Hadoop

stack (MapRed, Pig, HDFS). • Hadoop API called directly,

no more ssh interaction req. • Deals with classpath

problems and different library versions.

Page 55: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Conclusions & Resources

Page 56: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

56

• When dealing with large amounts of data in terms of #files, #objects, #records, #TB storage traditional data management techniques begin to fail (file system operations, db , tools, etc.).

• Scalablity and Robustness are key. • Data-intensive technologies can help a great deal but do not

support desktop tools and workflows used in many domains out of the box.

• SCAPE has ported a number of preservation scenarios identified by its user groups from sequential workflows to a scalable (Hadoop-based) environment.

• The required effort can vary a lot depending on the infrastructure in place, the nature of the data, scale, complexity, and required performance.

Conclusions

Page 57: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

57

• Project website: www.scape-project.eu • Github: https://github.com/openplanets/ • SCAPE Group on MyExperiment: http://www.myexperiment.org • SCAPE tools: http://www.scape-project.eu/tools • SCAPE on Slideshare: http://www.slideshare.net/SCAPEproject • SCAPE Appliction Areas at Austrian National Library:

• http://www.slideshare.net/SvenSchlarb/elag2013-schlarb • Submission and execution of SCAPE workflows:

• http://www.scape-project.eu/deliverable/d5-2-job-submission-language-and-interface

Resources

Page 58: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Thank you! Questions?

Page 59: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

Backup Slides

Page 60: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

60

find

/NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ...

...

NAS

reading files from NAS

1,4 GB 1,2 GB

60.000 books (24 Million pages): ~ 5 h + ~ 38 h = ~ 43 h

Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...

Reading image metadata

Page 61: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

61

find

/NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

NAS

reading files from NAS

1,4 GB 997 GB (uncompressed)

60.000 books (24 Million pages): ~ 5 h + ~ 24 h = ~ 29 h

HtmlPathCreator SequenceFileCreator SequenceFile creation

Page 62: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

62

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005 ...

Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400

Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250

Map Reduce HadoopAvBlockWidthMapReduce

SequenceFile Textfile

Calculate average block width using MapReduce

60.000 books (24 Million pages): ~ 6 h

Page 63: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

63

HiveLoadExifData & HiveLoadHocrData

jid jwidth Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidth

jp2width

Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700

Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250

CREATE TABLE jp2width (hid STRING, jwidth INT)

CREATE TABLE htmlwidth (hid STRING, hwidth INT)

Analytic Queries

Page 64: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

64

HiveSelect

jid jwidth Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidth jp2width

jid jwidth hwidth Z119585409/00000001 2250 1870

Z119585409/00000002 2150 2100

Z119585409/00000003 2125 2015

Z119585409/00000004 2125 1350

Z119585409/00000005 2250 1700

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Analytic Queries

Page 65: Scalable Preservation Workflows

SCAlable Preservation Environments SCAPE

EOF