collaborative workflow development and experimentation in the digital humanities
DESCRIPTION
A Service-Oriented-Architecture for Collaborative Workflow Development and Experimentation in the Digital Humanities 2012 Leipzig eHumanities Seminar, 10 October 2012, Leipzig, Germany.TRANSCRIPT
![Page 1: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/1.jpg)
A Service-Oriented Architecture for Collaborative Workflow
Development and Experimentation
Clemens Neudecker, KB @cneudeckerZeki Mustafa Dogan, SUB-DL
Sven Schlarb, ÖNB @SvenSchlarbJuan Garcés, GCDH @juan_garces
eHumanities Seminar 2012University of Leipzig
10-10-2012
![Page 2: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/2.jpg)
Idea
• Provide web-based versions of tools (web services)
• Package web services, data and documentation into ready-to-run “components” (encapsulation)
• Chain the components to create workflows via drag-and-drop operation
• Share and use workflows to re-run experiments and to demonstrate results
![Page 3: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/3.jpg)
Background
• High degree of diversity in research topics, but also tools and frameworks being used
• Technical resources should be easy to use, well documented, accessible from anywhere
• Prevent re-inventing of the wheel
![Page 4: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/4.jpg)
Requirements
• Interoperability = connect different resources• Flexibility = easy to deploy and adapt• Modularity = allow different combinations of tools• Usability = simple to use for non-technical users• Re-usability = easy to share with others• Scalability = apt for large-scale processing• Sustainability = resources simple to preserve• Transparency = tools evaluated separately• Distributed development and deployment
![Page 5: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/5.jpg)
Interoperability Framework (IIF)
• Modules:- Java Wrapper for command line tools- Web Services (incl. format converters)- Taverna Workflow Engine- Client interfaces- Repository connectors
![Page 6: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/6.jpg)
Sources
https://github.com/impactcentre/interoperability-framework
![Page 7: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/7.jpg)
IIF Command Line Wrapper
• Java project, builds using Maven2
• Creates a web service project from a given tool description (XML)
• Web service exposes SOAP & REST endpoints and Java API interface
• Requirements: command line call, no direct user interaction
![Page 8: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/8.jpg)
![Page 9: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/9.jpg)
![Page 10: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/10.jpg)
![Page 11: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/11.jpg)
![Page 12: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/12.jpg)
IIF Web Services
• Web services are described by a WSDL
• Input/output data structures
• Data is referenced by URL
• Annotations
• Default values
![Page 13: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/13.jpg)
REST
![Page 14: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/14.jpg)
SOAP
![Page 15: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/15.jpg)
IIF Workflows
• What is a workflow? (Yahoo Pipes, etc.)
• Different kinds of workflows: for a single command, application, chain of processes
• Main benefit: Encapsulation, Reuse
• Workflows as “components”: include link to WS endpoint, sample input data and documentation = ready-to-use resource
• Web 2.0 workflow registry: myExperiment
![Page 16: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/16.jpg)
![Page 17: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/17.jpg)
Why workflows?• “In-silico experimentation”
• Good structuring of experiment setup:– Challenge/Research question– Dataset definition– Processing with algorithms– Evaluation/Provenance– Presentation of results
• All this can be modelled into a workflow
![Page 18: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/18.jpg)
Integration into Taverna
• Web Services (SOAP and REST)
• Command line tools (SH and SSH)
• Beanshells (can import Java libraries)
• R (statistics)
• Excel, CSV
• Additional service types can be added through dedicated plug-ins
![Page 19: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/19.jpg)
Taverna flavours
• Workbench – local GUI client for Linux, Windows, OSX
• Command line tool – run workflows from the command line
• Server – Webapp with REST API and Java/Ruby client libs
• Web-Wf-Designer – Javascript version for designing workflows in a browser
![Page 20: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/20.jpg)
Workbench
![Page 21: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/21.jpg)
Webapp
![Page 22: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/22.jpg)
Workflow registry
![Page 23: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/23.jpg)
Client interfaces
• Web service client: create a simple HTML form from a given web service description
• Taverna client: create a simple HTML form from a given Taverna workflow description
integration into production and presentation environments via iframes
![Page 24: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/24.jpg)
WS-client
![Page 25: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/25.jpg)
T2-client
![Page 26: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/26.jpg)
Repositories
• Accessible via web service API– Fedora Commons – WebDAV – PRImA
![Page 27: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/27.jpg)
Architecture
![Page 28: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/28.jpg)
Examples
• Use case 1: OCR (IMPACT)
• Start: Images (scanned documents)
• Processing: OCR, NLP, Evaluation
• Result: Full text, Entities, Sentiments
![Page 29: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/29.jpg)
![Page 30: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/30.jpg)
![Page 31: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/31.jpg)
![Page 32: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/32.jpg)
![Page 33: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/33.jpg)
Examples
• Use case 2: Preservation (SCAPE)
• Start: Document collection preparation
• Processing: Hadoop, Hive
• Result: Statistics
![Page 34: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/34.jpg)
![Page 35: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/35.jpg)
![Page 36: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/36.jpg)
find
/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2...
...
NAS
reading files from NAS
1,4 GB 1,2 GB
: ~ 5 h + ~ 38 h = ~ 43 h60.000 books
24 Million pages
Jp2PathCreator HadoopStreamingExiftoolRead
Z119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...
Reading image metadata
![Page 37: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/37.jpg)
![Page 38: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/38.jpg)
find
/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...
Z119585409/00000707
Z119585409/00000708
Z119585409/00000709
Z119585409/00000710
Z119585409/00000711
Z119585409/00000712
NAS
reading files from NAS
1,4 GB 997 GB (uncompressed)
: ~ 5 h + ~ 24 h = ~ 29 h60.000 books
24 Million pages
HtmlPathCreator SequenceFileCreator
Sequence file creation
![Page 39: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/39.jpg)
![Page 40: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/40.jpg)
Z119585409/00000001
Z119585409/00000002
Z119585409/00000003
Z119585409/00000004
Z119585409/00000005...
: ~ 6 h60.000 books
24 Million pages
Z119585409/00000001 2100 Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400
Z119585409/00000002 2100 Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400
Z119585409/00000003 2100 Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400
Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400
Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400
Z119585409/00000001 2250
Z119585409/00000002 2250
Z119585409/00000003 2250
Z119585409/00000004 2250
Z119585409/00000005 2250
Map Reduce
HadoopAvBlockWidthMapReduce
SequenceFile Textfile
HTML parsing
![Page 41: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/41.jpg)
![Page 42: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/42.jpg)
: ~ 6 h60.000 books
24 Million pages
HiveLoadExifData & HiveLoadHocrData
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidth
jp2width
Z119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700
Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250
CREATE TABLE jp2width(hid STRING, jwidth INT)
CREATE TABLE htmlwidth(hid STRING, hwidth INT)
Analytic Queries
![Page 43: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/43.jpg)
: ~ 6 h60.000 books
24 Million pages
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidthjp2width
jid jwidth hwidth
Z119585409/00000001
2250 1870
Z119585409/00000002
2150 2100
Z119585409/00000003
2125 2015
Z119585409/00000004
2125 1350
Z119585409/00000005
2250 1700
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
Analytic QueriesHiveSelect
![Page 44: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/44.jpg)
Examples
• Use case 3: Curation (GDZ)
• Start: Get documents from repository
• Processing: Enrichment (OCR, Entities, GeoNames)
• Result: Online presentation
![Page 45: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/45.jpg)
![Page 46: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/46.jpg)
![Page 47: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/47.jpg)
![Page 48: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/48.jpg)
![Page 49: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/49.jpg)
![Page 50: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/50.jpg)
![Page 51: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/51.jpg)
![Page 52: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/52.jpg)
ROPEN(= Resource Oriented Presentation ENvironment)
![Page 53: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/53.jpg)
Scalability
• Multiple options:
- Service parallelization
- Cloud
- Grid
- Hadoop
![Page 54: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/54.jpg)
Compatibility
• Taverna UIMA
• Taverna Galaxy
• Taverna Kepler
• Taverna Weblicht
• Taverna Seasr
![Page 55: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/55.jpg)
But…
• Multi-layered approach increases complexity (debugging, maintenance)
• Diverse set of endpoints (OS, CPU, etc.)
• Multiple dependencies
• Shared responsibilities
• Authentication & Authorization
• Error handling / Fail-over / Monitoring
![Page 56: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/56.jpg)
Demo(s)
![Page 57: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/57.jpg)
Discussion
• Potential/use cases DH?
• Tools/features to make available?
• Questions, comments or remarks?
![Page 58: Collaborative Workflow Development and Experimentation in the Digital Humanities](https://reader038.vdocument.in/reader038/viewer/2022110308/557dcb90d8b42a93718b48d1/html5/thumbnails/58.jpg)
Thank you!