collaborative workflow development and experimentation in the digital humanities

A Service-Oriented Architecture for Collaborative Workflow

Development and Experimentation

Clemens Neudecker, KB @cneudeckerZeki Mustafa Dogan, SUB-DL

Sven Schlarb, ÖNB @SvenSchlarbJuan Garcés, GCDH @juan_garces

eHumanities Seminar 2012University of Leipzig

10-10-2012

• Provide web-based versions of tools (web services)

• Package web services, data and documentation into ready-to-run “components” (encapsulation)

• Chain the components to create workflows via drag-and-drop operation

• Share and use workflows to re-run experiments and to demonstrate results

Background

• High degree of diversity in research topics, but also tools and frameworks being used

• Technical resources should be easy to use, well documented, accessible from anywhere

• Prevent re-inventing of the wheel

Requirements

• Interoperability = connect different resources• Flexibility = easy to deploy and adapt• Modularity = allow different combinations of tools• Usability = simple to use for non-technical users• Re-usability = easy to share with others• Scalability = apt for large-scale processing• Sustainability = resources simple to preserve• Transparency = tools evaluated separately• Distributed development and deployment

Interoperability Framework (IIF)

• Modules:- Java Wrapper for command line tools- Web Services (incl. format converters)- Taverna Workflow Engine- Client interfaces- Repository connectors

Sources

https://github.com/impactcentre/interoperability-framework

IIF Command Line Wrapper

• Java project, builds using Maven2

• Creates a web service project from a given tool description (XML)

• Web service exposes SOAP & REST endpoints and Java API interface

• Requirements: command line call, no direct user interaction

IIF Web Services

• Web services are described by a WSDL

• Input/output data structures

• Data is referenced by URL

• Annotations

• Default values

IIF Workflows

• What is a workflow? (Yahoo Pipes, etc.)

• Different kinds of workflows: for a single command, application, chain of processes

• Main benefit: Encapsulation, Reuse

• Workflows as “components”: include link to WS endpoint, sample input data and documentation = ready-to-use resource

• Web 2.0 workflow registry: myExperiment

Why workflows?• “In-silico experimentation”

• Good structuring of experiment setup:– Challenge/Research question– Dataset definition– Processing with algorithms– Evaluation/Provenance– Presentation of results

• All this can be modelled into a workflow

Integration into Taverna

• Web Services (SOAP and REST)

• Command line tools (SH and SSH)

• Beanshells (can import Java libraries)

• R (statistics)

• Excel, CSV

• Additional service types can be added through dedicated plug-ins

Taverna flavours

• Workbench – local GUI client for Linux, Windows, OSX

• Command line tool – run workflows from the command line

• Server – Webapp with REST API and Java/Ruby client libs

• Web-Wf-Designer – Javascript version for designing workflows in a browser

Workbench

Webapp

Workflow registry

Client interfaces

• Web service client: create a simple HTML form from a given web service description

• Taverna client: create a simple HTML form from a given Taverna workflow description

integration into production and presentation environments via iframes

WS-client

T2-client

Repositories

• Accessible via web service API– Fedora Commons – WebDAV – PRImA

Architecture

Examples

• Use case 1: OCR (IMPACT)

• Start: Images (scanned documents)

• Processing: OCR, NLP, Evaluation

• Result: Full text, Entities, Sentiments

Examples

• Use case 2: Preservation (SCAPE)

• Start: Document collection preparation

• Processing: Hadoop, Hive

• Result: Statistics

/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2...

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h60.000 books

24 Million pages

Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...

Reading image metadata

/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h60.000 books

24 Million pages

HtmlPathCreator SequenceFileCreator

Sequence file creation

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005...

: ~ 6 h60.000 books

24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400

Z119585409/00000001 2250

Z119585409/00000002 2250

Z119585409/00000003 2250

Z119585409/00000004 2250

Z119585409/00000005 2250

Map Reduce

HadoopAvBlockWidthMapReduce

SequenceFile Textfile

HTML parsing

: ~ 6 h60.000 books

24 Million pages

HiveLoadExifData & HiveLoadHocrData

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidth

jp2width

Z119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700

Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250

CREATE TABLE jp2width(hid STRING, jwidth INT)

CREATE TABLE htmlwidth(hid STRING, hwidth INT)

Analytic Queries

: ~ 6 h60.000 books

24 Million pages

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidthjp2width

jid jwidth hwidth

Z119585409/00000001

2250 1870

Z119585409/00000002

2150 2100

Z119585409/00000003

2125 2015

Z119585409/00000004

2125 1350

Z119585409/00000005

2250 1700

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Analytic QueriesHiveSelect

Examples

• Use case 3: Curation (GDZ)

• Start: Get documents from repository

• Processing: Enrichment (OCR, Entities, GeoNames)

• Result: Online presentation

ROPEN(= Resource Oriented Presentation ENvironment)

Scalability

• Multiple options:

- Service parallelization

- Cloud

- Grid

- Hadoop

Compatibility

• Taverna UIMA

• Taverna Galaxy

• Taverna Kepler

• Taverna Weblicht

• Taverna Seasr

But…

• Multi-layered approach increases complexity (debugging, maintenance)

• Diverse set of endpoints (OS, CPU, etc.)

• Multiple dependencies

• Shared responsibilities

• Authentication & Authorization

• Error handling / Fail-over / Monitoring

Demo(s)

Discussion

• Potential/use cases DH?

• Tools/features to make available?

• Questions, comments or remarks?

Thank you!

collaborative workflow development and experimentation in the digital humanities

iif web services web

resource web

taverna web services

web service project

iif workflows

rest command line tools

reuse workflows

workflow registry

Technology

animal experimentation

photograph experimentation

anmal experimentation

experimentation - air power dev...

lean experimentation

prototyping & experimentation · prototyping &...

experimentation mindset

strategic experimentation

sample courses mechanical engineering · mme 2285b...

aperture experimentation

dive mechanic: bringing 3d virtual experimentation to...

early experimentation

human experimentation

final experimentation

digipak experimentation

networking testbeds experimentation on - nyu -...

watercolor experimentation

powerpoint presentation · faculty of engineering faculty...

experimentation photography

ransac experimentation