scape information day at bl - large scale processing with hadoop

13
William Palmer Some slides courtesy of Per Møldrup-Dalum (State and University Library, Denmark) and Sven Schlarb (Austrian National Library) SCAPE Information Day British Library, UK, 14 th July 2014 Large Scale Processing with Hadoop

Upload: scape-project

Post on 05-Dec-2014

76 views

Category:

Technology


1 download

DESCRIPTION

This presentation was given by Will Palmer at ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. In this presentation Will Palmer introduced Hadoop and the way the British Library and SCAPE have used Hadoop to process large-scale data.

TRANSCRIPT

Page 1: SCAPE Information Day at BL - Large Scale Processing with Hadoop

William Palmer Some slides courtesy of Per Møldrup-Dalum (State and University Library, Denmark) and Sven Schlarb (Austrian National Library)

SCAPE Information Day

British Library, UK, 14th July 2014

Large Scale Processing with Hadoop

Page 2: SCAPE Information Day at BL - Large Scale Processing with Hadoop

2

Large Scale Processing Methodologies

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

• Traditional

• One central large processor capability

• One+ central storage instance

• Data stored away from processor

• Paradigm: “Move the data to the processor”

• Hadoop

• Many smaller commodity computers/CPUs

• Storage capacity in all computers, federated together

• Easily expandable

• Paradigm: “Move the processor to the data”

Page 3: SCAPE Information Day at BL - Large Scale Processing with Hadoop

• The New York Times + Hadoop on Amazon Web Services

• 11 million articles (1851-1980) that need to be converted to PDF

• 4TB TIFF data

• 24 hours wall time to complete the migration

• Cost: $240 (not including bandwidth)

• http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-

computing-fun/

• http://cse.unl.edu/~byrav/INFOCOM2011/workshops/papers/p1099-xiao.pdf

3

Example

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 4: SCAPE Information Day at BL - Large Scale Processing with Hadoop

4

Hadoop Ecosystem: The Zoo

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

HDFS – data locality MapReduce

•••

Page 5: SCAPE Information Day at BL - Large Scale Processing with Hadoop

5

MapReduce

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

MAP

REDUCE

Page 6: SCAPE Information Day at BL - Large Scale Processing with Hadoop

6

MapReduce in detail

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Map Map Reduce Reduce

Sort

Shuffle

Merge

Input

Input Split

Record

Record

Record …

Input Split

Record

Record

Record …

Input Split

Record

Record

Record …

Map Output

Map Output

Reducer Output

Page 7: SCAPE Information Day at BL - Large Scale Processing with Hadoop

7

Hadoop In Action

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

• Designed for processing text

• Capacity can be reduced/expanded

• Comes with HDFS filesystem, with federation and redundancy (three copies of data by default)

• Using commodity hardware node failures are expected

• A node being down should not affect the cluster

• Data locality is considered when distributing computation, processing data where it is stored, reducing the need to transfer it

• Very large community and ecosystem

Page 8: SCAPE Information Day at BL - Large Scale Processing with Hadoop

8

(Obligatory) Hadoop Screenshots

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

14/02/13 11:22:33 INFO gzchecker.GZChecker: Loading paths... 14/02/13 11:22:36 INFO gzchecker.GZChecker: Setting paths... 14/02/13 11:22:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. … 14/02/13 11:22:39 INFO mapred.FileInputFormat: Total input paths to process : 1 14/02/13 11:22:40 INFO mapred.JobClient: Running job: job_201401131502_0058 14/02/13 11:22:41 INFO mapred.JobClient: map 0% reduce 0% …

Page 9: SCAPE Information Day at BL - Large Scale Processing with Hadoop

9

Hadoop In Action

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

• We are using Hadoop/MapReduce for parallelisation

• Non standard use case

• As a parallelisation method costs are associated …

• … but get a lot of well supported features for free • HDFS

• Administration

• Support

• Once a MapReduce program is developed scalability just happens

• Can theoretically prototype on a Raspberry Pi and run on a 3000 node super cluster

Page 10: SCAPE Information Day at BL - Large Scale Processing with Hadoop

10

Hadoop In Action

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

• Do I have to copy data to HDFS for processing?

• 1TB of data took 8 hours to copy from NAS to HDFS

• Image format migration (TIFF-JP2) took ~57hours

• … still got to get the data back to the NAS

• What if I don’t?

• Same image format migration code accessing/posting data directly from/to Repository took ~58hours

• No copying data before/after

• More efficient as processing time is greater per file

• Won’t necessarily hold for different preservation actions (see: “small files problem”)

Page 11: SCAPE Information Day at BL - Large Scale Processing with Hadoop

11

Hadoop at The British Library

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

• Two Hadoop clusters:

• Digital Preservation Team Cluster • Virtualised hardware

• 1 management node, 1 master node

• 28 worker nodes (1 core/1 CPU, 6GB RAM each)

• 14TB raw storage, 5TB useable @ replication of 3

• Cloudera Hadoop (CDH4)

• For testing/R&D

• Web Archiving Team Cluster • Physical hardware

• 80 nodes (8 cores/2CPUs, 16GB RAM)

• 700TB raw storage, 233TB useable @ replication of 3

• Cloudera Hadoop (CDH3)

• In production use

Page 12: SCAPE Information Day at BL - Large Scale Processing with Hadoop

• TIFF->JP2 migration with QA

• Single node @ 26 files/hour (with OpenJPEG)

• 28 nodes @ 735 files/hour (with OpenJPEG)

• 2409 files/hour with Kakadu

• Detecting DRM in PDF files

• 28 nodes @ 51869 files/hour

• Identifying web content

• 5.3million files/hour

12

SCAPE Workflow Results

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 13: SCAPE Information Day at BL - Large Scale Processing with Hadoop

• SCAPE tools are treated as individual components and should be reusable on other large scale execution platforms (all tools described today are, at least)

• British Library Digital Library System (DLS) has a bespoke workflow execution system where SCAPE tools have been integrated

• Other platforms: GNU Parallel …

• Tools can be integrated with your own systems

13

Other Large Scale Execution Platforms

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).