collaborative workflow development and experimentation in the digital humanities
Post on 15-Jun-2015
64 Views
Preview:
DESCRIPTION
TRANSCRIPT
A Service-Oriented Architecture for Collaborative Workflow
Development and Experimentation
Clemens Neudecker, KB @cneudeckerZeki Mustafa Dogan, SUB-DL
Sven Schlarb, ÖNB @SvenSchlarbJuan Garcés, GCDH @juan_garces
eHumanities Seminar 2012University of Leipzig
10-10-2012
Idea
• Provide web-based versions of tools (web services)
• Package web services, data and documentation into ready-to-run “components” (encapsulation)
• Chain the components to create workflows via drag-and-drop operation
• Share and use workflows to re-run experiments and to demonstrate results
Background
• High degree of diversity in research topics, but also tools and frameworks being used
• Technical resources should be easy to use, well documented, accessible from anywhere
• Prevent re-inventing of the wheel
Requirements
• Interoperability = connect different resources• Flexibility = easy to deploy and adapt• Modularity = allow different combinations of tools• Usability = simple to use for non-technical users• Re-usability = easy to share with others• Scalability = apt for large-scale processing• Sustainability = resources simple to preserve• Transparency = tools evaluated separately• Distributed development and deployment
Interoperability Framework (IIF)
• Modules:- Java Wrapper for command line tools- Web Services (incl. format converters)- Taverna Workflow Engine- Client interfaces- Repository connectors
Sources
https://github.com/impactcentre/interoperability-framework
IIF Command Line Wrapper
• Java project, builds using Maven2
• Creates a web service project from a given tool description (XML)
• Web service exposes SOAP & REST endpoints and Java API interface
• Requirements: command line call, no direct user interaction
IIF Web Services
• Web services are described by a WSDL
• Input/output data structures
• Data is referenced by URL
• Annotations
• Default values
REST
SOAP
IIF Workflows
• What is a workflow? (Yahoo Pipes, etc.)
• Different kinds of workflows: for a single command, application, chain of processes
• Main benefit: Encapsulation, Reuse
• Workflows as “components”: include link to WS endpoint, sample input data and documentation = ready-to-use resource
• Web 2.0 workflow registry: myExperiment
Why workflows?• “In-silico experimentation”
• Good structuring of experiment setup:– Challenge/Research question– Dataset definition– Processing with algorithms– Evaluation/Provenance– Presentation of results
• All this can be modelled into a workflow
Integration into Taverna
• Web Services (SOAP and REST)
• Command line tools (SH and SSH)
• Beanshells (can import Java libraries)
• R (statistics)
• Excel, CSV
• Additional service types can be added through dedicated plug-ins
Taverna flavours
• Workbench – local GUI client for Linux, Windows, OSX
• Command line tool – run workflows from the command line
• Server – Webapp with REST API and Java/Ruby client libs
• Web-Wf-Designer – Javascript version for designing workflows in a browser
Workbench
Webapp
Workflow registry
Client interfaces
• Web service client: create a simple HTML form from a given web service description
• Taverna client: create a simple HTML form from a given Taverna workflow description
integration into production and presentation environments via iframes
WS-client
T2-client
Repositories
• Accessible via web service API– Fedora Commons – WebDAV – PRImA
Architecture
Examples
• Use case 1: OCR (IMPACT)
• Start: Images (scanned documents)
• Processing: OCR, NLP, Evaluation
• Result: Full text, Entities, Sentiments
Examples
• Use case 2: Preservation (SCAPE)
• Start: Document collection preparation
• Processing: Hadoop, Hive
• Result: Statistics
find
/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2...
...
NAS
reading files from NAS
1,4 GB 1,2 GB
: ~ 5 h + ~ 38 h = ~ 43 h60.000 books
24 Million pages
Jp2PathCreator HadoopStreamingExiftoolRead
Z119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...
Reading image metadata
find
/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...
Z119585409/00000707
Z119585409/00000708
Z119585409/00000709
Z119585409/00000710
Z119585409/00000711
Z119585409/00000712
NAS
reading files from NAS
1,4 GB 997 GB (uncompressed)
: ~ 5 h + ~ 24 h = ~ 29 h60.000 books
24 Million pages
HtmlPathCreator SequenceFileCreator
Sequence file creation
Z119585409/00000001
Z119585409/00000002
Z119585409/00000003
Z119585409/00000004
Z119585409/00000005...
: ~ 6 h60.000 books
24 Million pages
Z119585409/00000001 2100 Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400
Z119585409/00000002 2100 Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400
Z119585409/00000003 2100 Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400
Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400
Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400
Z119585409/00000001 2250
Z119585409/00000002 2250
Z119585409/00000003 2250
Z119585409/00000004 2250
Z119585409/00000005 2250
Map Reduce
HadoopAvBlockWidthMapReduce
SequenceFile Textfile
HTML parsing
: ~ 6 h60.000 books
24 Million pages
HiveLoadExifData & HiveLoadHocrData
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidth
jp2width
Z119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700
Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250
CREATE TABLE jp2width(hid STRING, jwidth INT)
CREATE TABLE htmlwidth(hid STRING, hwidth INT)
Analytic Queries
: ~ 6 h60.000 books
24 Million pages
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidthjp2width
jid jwidth hwidth
Z119585409/00000001
2250 1870
Z119585409/00000002
2150 2100
Z119585409/00000003
2125 2015
Z119585409/00000004
2125 1350
Z119585409/00000005
2250 1700
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
Analytic QueriesHiveSelect
Examples
• Use case 3: Curation (GDZ)
• Start: Get documents from repository
• Processing: Enrichment (OCR, Entities, GeoNames)
• Result: Online presentation
ROPEN(= Resource Oriented Presentation ENvironment)
Scalability
• Multiple options:
- Service parallelization
- Cloud
- Grid
- Hadoop
Compatibility
• Taverna UIMA
• Taverna Galaxy
• Taverna Kepler
• Taverna Weblicht
• Taverna Seasr
But…
• Multi-layered approach increases complexity (debugging, maintenance)
• Diverse set of endpoints (OS, CPU, etc.)
• Multiple dependencies
• Shared responsibilities
• Authentication & Authorization
• Error handling / Fail-over / Monitoring
Demo(s)
Discussion
• Potential/use cases DH?
• Tools/features to make available?
• Questions, comments or remarks?
Thank you!
top related