scape andy jackson the british library scapedev1 ait, vienna - 6 th – 7 th june 2011 pc...

27
SCAP E Andy Jackson The British Library SCAPEdev1 AIT, Vienna - 6 th – 7 th June 2011 PC Integration Plan First SCAPE Developers’ Workshop

Upload: lester-dean

Post on 26-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

SCAPE

Andy JacksonThe British Library

SCAPEdev1AIT, Vienna - 6th – 7th June 2011

PC Integration PlanFirst SCAPE Developers’ Workshop

SCAPEThe Challenges

• Reproducible tool invocation across all contexts:• CLI, Java, SOAP to REST

• Interoperable data formats and consistent semantics across contexts where required• So clients can call tools and understand outputs correctly

• Ease of development and deployment

2

SCAPEPreservation Components (Tools)

• Characterization Components• Tell me about this digital object…• Does it contain known preservation risks?• Is it valid by the spec? Or my profile spec?

• Preservation Action Components• Transform this digital object into this format…• Repair links, or remove preservation risks

• Quality Assurance Components• Assess the differences between these two objects…• Assess the Preservation Actions

3

SCAPETool Integration Roadmap

• For year one• Focus on getting the Testbeds going• Deploying ad-hoc WSDL/SOAP web services• While learning what we need

• For year two• Start using Tool Specifications

• Define how to run tools (CLI or pure Java)• Invoked locally, or as RESTful services

• For year three• We’ll see, based on year two.

4

SCAPEYear One Plan

• Taverna for the Testbeds• Loose integration via WSDL/SOAP

• XML inputs and outputs can be managed easily• Workflows work without installing anything else

• ONB is hosting services at the moment• Deployed via Sven’s Axis2 wrapper (on GitHub)

• Not so pretty but it works.• Exports parameters as ports so Taverna can show them

• Planning ahead…• Working with Taverna External Tools & Components• Building shareable tool specifications

5

SCAPETool Specifications

• Simple XML definition specifies the tool and how to invoke it to perform particular actions• Based on Taverna External Tools plugin:• http

://www.mygrid.org.uk/dev/wiki/display/developer/Calling+external+commands+from+Taverna

<program name="bourne_shell_script" description="execute shell script”    command="/bin/sh input">    <input name="shell_script">        <file path="input" />    </input></program>

6

SCAPERe-using Tool Specifications

• Makes sharing tool specs easy• Email tool spec. to colleague, or share on GitHub, etc.

• Allows more reproducible tool invocation:• Invoked from Taverna, CLI or via REST via shared ‘launcher’

code (we only write the wrapper once)• Invoker can add performance metrics automatically• Invoker could add optional deep process analysis• Process results can be shared

7

SCAPEInteroperability: Tools & Components

• Taverna wants to standardize Components• Want to hot-swap different implementations of same act

• Planets had standardized actions• WSDL-based interface definitions• Too high-level, local first please!• Too complex, non-extensible, tool wrapping was hard

• Lets bring the two together…

8

SCAPECLI and Java Interfaces

• Extensible Java method signatures & CLI templates• e.g. Identify must accept at least a digital object, and

return at least a URI• Extra parameters may exist, but must have sane defaults

• More flexible that in Planets• But tight enough that clients can call easily• Coded for local data and/or streams

• More constrained than ‘vanilla’ Taverna use • Should align with Taverna Component efforts

9

SCAPEStandard, Extensible Interfaces

• Standard processes may include:• Identify, Characterize, Validate• Migrate/Transform/Convert• Compare, Assess• We should document the logic on the SCAPE wiki.

• Deployment helpers wrap this up to make it deploy in different contexts• CLI Invoker for local development and testing• JAX-RS RESTful service mapping• Also wrap benchmarking code around invocation

10

SCAPEInteroperability: Data Handling

• Planets defaulted to pass-by-value• Cumbersome, brittle.

• SCAPE will default to pass-by-reference (URI)• Leverage URI schemes to delegate issues like encoding and

authentication to the transport layer.• More modular design, leveraging standard transports.

• Java/CLI will expect local files or streams• Wrapper layer handles retrieving items via URI• Separation of concerns – wrapper could support e.g.

HTTP(S), SMB/CIFS, FTP, SFTP/SCP, HBase URI, etc. • May modify or re-use JHOVE2 Source/Input objects

11

SCAPEInteroperability: Data Formats

• The required arguments passed to tools will be standardized via Java/JAXB• e.g. JHOVE2 property tree as Characterization result,

mapped to and from XML• Some other concepts will also need standardization

• Service description for discovery (WADL?)• The optional arguments need a declaration• Format identifiers for supported input/output formats

• Passed through the TCC to review and disseminate

12

SCAPEInteroperability: Sharing Concepts

• Common concepts shared on the SCAPE wiki• Tool interface definitions• Data definitions• Both linked to the source code, headed for the JavaDoc

• A SCAPE/OPF Registry• First understand what we really need for tool discovery and

use, based on initial integration plan.• Then mix-in wider integration issues.• Define only format identifiers, or do more?• Track and merge with UDFR effort? Now or later?

13

SCAPECC Development Plan

• Develop FITS, DROID, file etc.• For identification (including conflict resolution via FITS) and

brief characterization• Do not support compound objects well

• Develop JHOVE2 modules• For deep characterization, profile analysis, etc.• Supports compound objects• FITS as a JHOVE2 identification module?

14

SCAPECC Integrated Deployment

• CLI• FITS and JHOVE2 have CLI interfaces, wrap as Tool Specs

• REST API• Source URI in, Properties out• Property data to follow JHOVE2 form

• e.g. normalize output using the JHOVE2 property language• Properties have URIs

• RDF approach is compatible

15

SCAPECC Validation Interface

• Format/profile validation• Re-use JHOVE2 assessment language for profile validation,

if appropriate• RESTful version

• If we need a Validation over REST, consider re-using the W3C Unicorn Validator interface.

• http://code.w3.org/unicorn/wiki/Documentation/Observer

16

SCAPEPA Integration Plan

• Develop as standalone tools• Improving existing tools or making new ones

• Initially Web Services• As Sven has been doing

• CLI• Wrap standalone tool in Tool Spec, specifies input and

output formats etc.• REST

• Use src parameter to pass input & create new resource• Return alone or with a report via Content Negotiation

17

SCAPEQA Integration Plan

• Develop standalone tools• Improving existing tools or making new ones• Re-use JHOVE2 property language for comparative

properties.• Re-use JHOVE2 assessment language for profile validation?

• RESTful Compare interface• Two URIs in: src1 & src2• Properties out: re-using JHOVE2 model.

18

SCAPE

Repository IntegrationSome initial ideas

19

SCAPESCAPE Platform Repository Integration

• Given an existing repository of content, how process items on Hadoop.

• Two examples from the New York Times.

• Two initial proposals.

20

SCAPEHadoop & The New York Times

• 4TB of TIFFs+OCR to 1.5 TB PDFs• 11 million articles in 24 hours on 100 EC2 nodes

• They found a problem, but EC2 is cheap enough that could afford to run it twice.

• Tools• JetS3t – open source Java toolkit for S3• iText PDF Library• Java Advanced Image package

• http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/

21

SCAPENYT Project 2

“Using Amazon Web Services, Hadoop and our own code, we ingested 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFF’s. This data was converted to a more web-friendly 810,000 PNG images (thumbnails and full images) and 405,000 JavaScript files — all of it ready to be assembled into a TimesMachine. By leveraging the power of AWS and Hadoop, we were able to utilize hundreds of machines concurrently and process all the data in less than 36 hours.”http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/& http://open.blogs.nytimes.com/tag/hadoop/ 22

SCAPEPROPOSAL 1: Repository Caching Cluster

• Workflow driven from Hadoop:• User passes list of references to content in repo.• Hadoop downloads the item into HBase, returning a HBase

URI.• Hadoop process the item as required, using repo API to

post any results back to the repo.• Item remains cached in HBase until it is needed again.• Old items get bumped out if space runs low.

• This would suite the BL’s Digital Library System.• Storage architecture is decoupled from processing.

23

SCAPEPROPOSAL 2: Preservation Service Farm

• The repository drives the workflow, but needs to invoke services on lots of content.

• Underlying tools may have varied OS needs.• SCAPE Platform could spin-up machines as needed, each

providing RESTful endpoints that the repository can call.• Could be simple services or full workflows.

• Repository POSTs data to cluster and pulls the result back again.

• Requires a complex all-in-one repository system including workflow engines & triggers.

24

SCAPEPROPOSAL 3: Run Repository On HBase

• More radical but powerful option is to run the repository system on top of HBase.

• Very scaleable.• Powerful content analysis and processing.• But hard work if repository expects a traditional DB.

25

SCAPE

Development InfrastructureWorking together

26

SCAPE

• TCC, calls etc• Mailing lists:

[email protected]• Wiki what we are working on, so clear what

codebases we are improving.• http://wiki.opf-labs.org/display/SP/• See the Developers’ Guide

• Build Manager (IM)• System Manager and central cluster (IM)

27