collaborative workflow development and experimentation in the digital humanities

Post on 15-Jun-2015

64 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Service-Oriented-Architecture for Collaborative Workflow Development and Experimentation in the Digital Humanities 2012 Leipzig eHumanities Seminar, 10 October 2012, Leipzig, Germany.

TRANSCRIPT

A Service-Oriented Architecture for Collaborative Workflow

Development and Experimentation

Clemens Neudecker, KB @cneudeckerZeki Mustafa Dogan, SUB-DL

Sven Schlarb, ÖNB @SvenSchlarbJuan Garcés, GCDH @juan_garces

eHumanities Seminar 2012University of Leipzig

10-10-2012

Idea

• Provide web-based versions of tools (web services)

• Package web services, data and documentation into ready-to-run “components” (encapsulation)

• Chain the components to create workflows via drag-and-drop operation

• Share and use workflows to re-run experiments and to demonstrate results

Background

• High degree of diversity in research topics, but also tools and frameworks being used

• Technical resources should be easy to use, well documented, accessible from anywhere

• Prevent re-inventing of the wheel

Requirements

• Interoperability = connect different resources• Flexibility = easy to deploy and adapt• Modularity = allow different combinations of tools• Usability = simple to use for non-technical users• Re-usability = easy to share with others• Scalability = apt for large-scale processing• Sustainability = resources simple to preserve• Transparency = tools evaluated separately• Distributed development and deployment

Interoperability Framework (IIF)

• Modules:- Java Wrapper for command line tools- Web Services (incl. format converters)- Taverna Workflow Engine- Client interfaces- Repository connectors

Sources

https://github.com/impactcentre/interoperability-framework

IIF Command Line Wrapper

• Java project, builds using Maven2

• Creates a web service project from a given tool description (XML)

• Web service exposes SOAP & REST endpoints and Java API interface

• Requirements: command line call, no direct user interaction

IIF Web Services

• Web services are described by a WSDL

• Input/output data structures

• Data is referenced by URL

• Annotations

• Default values

REST

SOAP

IIF Workflows

• What is a workflow? (Yahoo Pipes, etc.)

• Different kinds of workflows: for a single command, application, chain of processes

• Main benefit: Encapsulation, Reuse

• Workflows as “components”: include link to WS endpoint, sample input data and documentation = ready-to-use resource

• Web 2.0 workflow registry: myExperiment

Why workflows?• “In-silico experimentation”

• Good structuring of experiment setup:– Challenge/Research question– Dataset definition– Processing with algorithms– Evaluation/Provenance– Presentation of results

• All this can be modelled into a workflow

Integration into Taverna

• Web Services (SOAP and REST)

• Command line tools (SH and SSH)

• Beanshells (can import Java libraries)

• R (statistics)

• Excel, CSV

• Additional service types can be added through dedicated plug-ins

Taverna flavours

• Workbench – local GUI client for Linux, Windows, OSX

• Command line tool – run workflows from the command line

• Server – Webapp with REST API and Java/Ruby client libs

• Web-Wf-Designer – Javascript version for designing workflows in a browser

Workbench

Webapp

Workflow registry

Client interfaces

• Web service client: create a simple HTML form from a given web service description

• Taverna client: create a simple HTML form from a given Taverna workflow description

integration into production and presentation environments via iframes

WS-client

T2-client

Repositories

• Accessible via web service API– Fedora Commons – WebDAV – PRImA

Architecture

Examples

• Use case 1: OCR (IMPACT)

• Start: Images (scanned documents)

• Processing: OCR, NLP, Evaluation

• Result: Full text, Entities, Sentiments

Examples

• Use case 2: Preservation (SCAPE)

• Start: Document collection preparation

• Processing: Hadoop, Hive

• Result: Statistics

find

/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2...

...

NAS

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h60.000 books

24 Million pages

Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...

Reading image metadata

find

/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

NAS

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h60.000 books

24 Million pages

HtmlPathCreator SequenceFileCreator

Sequence file creation

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005...

: ~ 6 h60.000 books

24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400

Z119585409/00000001 2250

Z119585409/00000002 2250

Z119585409/00000003 2250

Z119585409/00000004 2250

Z119585409/00000005 2250

Map Reduce

HadoopAvBlockWidthMapReduce

SequenceFile Textfile

HTML parsing

: ~ 6 h60.000 books

24 Million pages

HiveLoadExifData & HiveLoadHocrData

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidth

jp2width

Z119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700

Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250

CREATE TABLE jp2width(hid STRING, jwidth INT)

CREATE TABLE htmlwidth(hid STRING, hwidth INT)

Analytic Queries

: ~ 6 h60.000 books

24 Million pages

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidthjp2width

jid jwidth hwidth

Z119585409/00000001

2250 1870

Z119585409/00000002

2150 2100

Z119585409/00000003

2125 2015

Z119585409/00000004

2125 1350

Z119585409/00000005

2250 1700

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Analytic QueriesHiveSelect

Examples

• Use case 3: Curation (GDZ)

• Start: Get documents from repository

• Processing: Enrichment (OCR, Entities, GeoNames)

• Result: Online presentation

ROPEN(= Resource Oriented Presentation ENvironment)

Scalability

• Multiple options:

- Service parallelization

- Cloud

- Grid

- Hadoop

Compatibility

• Taverna UIMA

• Taverna Galaxy

• Taverna Kepler

• Taverna Weblicht

• Taverna Seasr

But…

• Multi-layered approach increases complexity (debugging, maintenance)

• Diverse set of endpoints (OS, CPU, etc.)

• Multiple dependencies

• Shared responsibilities

• Authentication & Authorization

• Error handling / Fail-over / Monitoring

Demo(s)

Discussion

• Potential/use cases DH?

• Tools/features to make available?

• Questions, comments or remarks?

Thank you!

top related