liber satellite event, scape by sven schlarb

30
Sven Schlarb Österreichische Nationalbibliothek LIBER Satellite Event: APARSEN & SCAPE Workshop 21 May 2014, Austrian National Library, Vienna Application scenarios of the SCAPE project at the Austrian National Library

Upload: scape-project

Post on 05-Dec-2014

138 views

Category:

Technology


2 download

DESCRIPTION

Sven Schlarb from the Austrian National Libraries gave an overview of the different application scenarios at the Austrian National Libraries related to Web Archiving and the Austrian Books Online project. The presentation was given at the LIBER Satellite Event on Long term accessibility of digital resources in theory and practice, https://liber2014.univie.ac.at/satellite-event/, in Vienna on 21 May 2014.

TRANSCRIPT

Page 1: LIBER Satellite Event, SCAPE by Sven Schlarb

Sven Schlarb Österreichische Nationalbibliothek

LIBER Satellite Event: APARSEN & SCAPE Workshop 21 May 2014, Austrian National Library, Vienna

Application scenarios of the SCAPE project at the Austrian National Library

Page 2: LIBER Satellite Event, SCAPE by Sven Schlarb

• Examples of Big Data in memory institutions • What are the SCAPE Testbeds? • Motivation for the Austrian National Library • Hadoop in a nutshell • SCAPE Platform setup at the Austrian National Library • Selected SCAPE tools • Application scenarios

• Web Archiving • Austrian Books Online

Overview

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 3: LIBER Satellite Event, SCAPE by Sven Schlarb

• Google Books Project: 30 Million digital books • http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-public-library-launched

• Europeana: Metadata about over 24 million objects • Europeana annual report and accounts 2012, Europeana Foundation, April 2013

• Hathi Trust: 10 million volumes (over 5,6 million titles) comprising over 3,7 billion book page images • http://www.hathitrust.org/statistics_info

• Internet Archive: 364 billion pages, about 10 Petabyte. • http://archive.org und http://archive.org/web/petabox.php

Books, Journals, Newspapers, Websites. Big data?

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 4: LIBER Satellite Event, SCAPE by Sven Schlarb

Takeup •Stakeholders and Communities

•Dissemination •Training Activities •Sustainability

Platform •Automation •Workflows

•Parallelization •Virtualization

SCAPE Project Overview

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 5: LIBER Satellite Event, SCAPE by Sven Schlarb

• Good: • Storing structured data • Expressive query language • ACID, type safety

• But: • SQL Joins not efficient at scale • ÖNB 2011: Failed creating a complete web-archive index

using single-instance-MySQL (write performance!)

• Solution? • Scaling vertically Bigger servers hardware costs! • Scaling horizontally Sharding maintenance costs!

Pushing the boundaries of RDBMs (e.g. MySQL)

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 6: LIBER Satellite Event, SCAPE by Sven Schlarb

• Hadoop means a cost-advantage because • It usually runs on relatively

inexpensive (commodity) hardware

• No binding to specific vendors

• Open-Source-Software

Comparison of storage costs

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Quelle: BITKOM Leitfaden Big-Data-Technologien-Wissen für Entscheider 2014, S. 39

Page 7: LIBER Satellite Event, SCAPE by Sven Schlarb

• Required to move data • From NAS to Server • To Cloud

• Multi-Terabyte senarios?

Dealing with large amounts of data

• Immediate processing • Unified storage and

processing capabilities

• Distributed I/O

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 8: LIBER Satellite Event, SCAPE by Sven Schlarb

• When dealing with large data sets it is usually easier to bring the processor to the data than the data to the processor

• Fine-granular parallelisation: All processing cores of the cluster are used as processors

• Designed for failure. In large clusters hardware failure is the norm rather than the exception

• Redundancy : Redundant storage of data blocks (default: 3 copies)

• Data locality: Free nodes with direct access to data do the processing

Some Basic hadoop assumptions

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 9: LIBER Satellite Event, SCAPE by Sven Schlarb

What is Hadoop (physically)? Distributed processing (MapReduce)

Distributed Storage (HDFS)

Hadoop = MapReduce + HDFS

2 x Quad-Core-CPUs: 10 Map (parallelisation) 4 Reduce (aggregation)

4 x 1 TB hard disks with redundancy 3: 1,33 TB effective

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 10: LIBER Satellite Event, SCAPE by Sven Schlarb

Configuration per CPU Configuration of one Quad-Core-CPU (= 1 node)

4 physical cores 8 hyperthreading-cores (System „sees“ 8 cores)

OS

Map

Map

Map

Map

Map

Reduce

Reduce

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 11: LIBER Satellite Event, SCAPE by Sven Schlarb

Experimental Cluster Job Tracker Task Trackers

Data Nodes

Name Node

CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading) RAM: 16GB DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB effective • Of 16 HT cores: 5 for Map; 2 for Reduce; 1 für Betriebssystem. 25 processing cores for Map tasks 10 processing cores for Reduce tasks

CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores) RAM: 24GB DISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 12: LIBER Satellite Event, SCAPE by Sven Schlarb

Sort

Shuffle

Merge

Input data

Input split 1

Record 1

Record 2

Record 3

Input split 2

Record 4

Record 5

Record 6

Input split 3

Record 7

Record 8

Record 9

What is Hadoop (conceptually)?

Task1

Map Reduce

Task 2

Task 3

Output data

Aggregated Result

Aggregated Result

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 13: LIBER Satellite Event, SCAPE by Sven Schlarb

Platform instance architecture at the Austrian National Library

• Access via REST API • Workflow engine for complex

jobs • Hive as the frontend for

analytic queries • MapReduce/Pig for

Extraction, Transform, and Load (ETL)

• „Small“ objects in HDFS or HBase

• „Large “ Digital objects stored on NetApp Filer

Taverna Workflow engine

REST API

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 14: LIBER Satellite Event, SCAPE by Sven Schlarb

Scalable Command Line Processing

ToMaR

Large-scale content profiling

C3PO

JPEG2000 file format validation

Jpylyzer

Duplicate image detection

Matchbox

Selected SCAPE tools in various application scenarios

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 15: LIBER Satellite Event, SCAPE by Sven Schlarb

• Web Archiving • Web Archive Mime Type Identification • Characterisation of web archive data

• Austrian Books Online • Scenario 2: Image File Format Migration • Scenario 3: Comparison of Book Derivatives • Scenario 4: MapReduce in Digitised Book Quality Assurance

Overview about application scenarios

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 16: LIBER Satellite Event, SCAPE by Sven Schlarb

Webarchiving

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

• Storage: ca. 45TB • ca. 1.7 Billion Objekts

• Domain harvesting • Entire top-level-domain .at

every 2 years

• Selective harvesting • Important websites that

change regularly

• Event harvesting • Special occasions and

events (e.g. elections)

Page 17: LIBER Satellite Event, SCAPE by Sven Schlarb

File format identification in web archives

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 18: LIBER Satellite Event, SCAPE by Sven Schlarb

(W)ARC Container

JPG

GIF

HTM

HTM

MID

(W)ARC InputFormat (W)ARC RecordReader

Basiert auf HERITRIX

Web Crawler

MapReduce

JPG Apache Tika detect MIME

Map Reduce

image/jpg

image/jpg 1 image/gif 1 text/html 2 audio/midi 1

File format identification in web archives

Software-Integration Durchsatz(GB/min) TIKA detector API in Map Phase 6,17 GB/min FILE als Kommandozeilen-Applikation mit MapReduce 1,70 GB/min TIKA JAR als Kommandozeilen-Applikation mit MapReduce 0,01 GB/min

Datenmenge Anzahl der ARC-Dateien Durchsatz(GB/min) 1 GB 10 x 100 MB 1,57 GB/min 2 GB 20 x 100 MB 2,5 GB/min 10 GB 100 x 100 MB 3,06 GB/min 20 GB 200 x 100 MB 3,40 GB/min 100 GB 1000 x 100 MB 3,71 GB/min

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 19: LIBER Satellite Event, SCAPE by Sven Schlarb

Characterisation of web archive data

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 20: LIBER Satellite Event, SCAPE by Sven Schlarb

Characterisation of web archive data

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 21: LIBER Satellite Event, SCAPE by Sven Schlarb

Characterisation of web archive data

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 22: LIBER Satellite Event, SCAPE by Sven Schlarb

• Public private partnership with Google • Only public domain • Objective to scan ~ 600.000 Volumes

• ~ 200 Mio. pages

• ~ 70 project team members • 20+ in core team

• ~ 200K physical volumes scanned so far • ~ 60 Mio pages

Austrian Books Online

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 23: LIBER Satellite Event, SCAPE by Sven Schlarb

ADOCO (Austrian Books Online Download & Control)

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

https://confluence.ucop.edu/display/Curation/PairTree

Google Public Private Partnership

ADOCO

Page 24: LIBER Satellite Event, SCAPE by Sven Schlarb

• TIFF to JPEG2000 migration • Objective: Reduce storage costs by

reducing the size of the images

• JPEG2000 to TIFF migration • Objective: Mitigation of the

JPEG2000 file format obsolescense risk

• Different preservation tool categories: • Validation • Migration • Quality assurance

Quality assured image file format migration

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 25: LIBER Satellite Event, SCAPE by Sven Schlarb

Comparison of book derivatives

• Compare different versions of the same book • Images come from different scanning sources • Images have been manipulated (cropped, rotated)

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 26: LIBER Satellite Event, SCAPE by Sven Schlarb

• 60.000 books, ~ 24 Million pages • Using Taverna‘s „Tool service“ (remote ssh execution) • Orchestration of different types of hadoop jobs

• Hadoop-Streaming-API • Hadoop Map/Reduce • Hive

• Workflow available on myExperiment: http://www.myexperiment.org/workflows/3105

• See Blogpost: http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-processing-chaining-hadoop-jobs-using-taverna

Using MapReduce for Quality Assurance

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 27: LIBER Satellite Event, SCAPE by Sven Schlarb

Using MapReduce for Quality Assurance

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Bildbreite

Blockbreite

Assumption: „Significant“ difference between average blockwidth and image width is an indicator for possible text loss due to cropping error.

Cropping error Correct cropping

Page 28: LIBER Satellite Event, SCAPE by Sven Schlarb

Using MapReduce for Quality Assurance

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

• Create input text files w. file paths (JP2 & HTML)

• Read image metadata using Exiftool (Hadoop Streaming API)

• Create sequence file containing all HTML files

• Calculate average block width using MapReduce

• Load data in Hive tables • Execute SQL test query

Page 29: LIBER Satellite Event, SCAPE by Sven Schlarb

• Possibility for libraries to build cost-efficient solutions for storing large data collections • HDFS as storage master or staging area? • Local cluster vs. cloud?

• Apache Hadoop offers a stable core for building a large scale processing platform; ready to be used in production

• Carefully select additional components from the Apache Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn, Ambari, etc.) that fit your needs

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Summary

Page 30: LIBER Satellite Event, SCAPE by Sven Schlarb

These slides on Slideshare: • http://de.slideshare.net/SvenSchlarb/

application-scenarios-of-the-scape-project-at-the-austrian-national-library

Further information • Project website: www.scape-project.eu • Github repository: www.github.com/openplanets • Project Wiki: www.wiki.opf-labs.org/display/SP/Home

SCAPE tools mentioned • ToMaR: http://openplanets.github.io/ToMaR/# • Jpylyzer: http://www.openplanetsfoundation.org/software/jpylyzer • Matchbox: https://github.com/openplanets/scape/tree/master/pc-qa-

matchbox • C3PO: http://ifs.tuwien.ac.at/imp/c3po

Thank you! Questions?

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).