a concept of generic workspace for big data processing in … · 2013. 9. 1. ·...
TRANSCRIPT
![Page 1: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/1.jpg)
Mitg
lied
derH
elm
holtz
-Gem
eins
chaf
t
A Concept ofGeneric Workspace forBig Data Processingin Humanities
2013-10-08 Jedrzej Rybicki, Benedikt von St. Vieth & Daniel Mallmann
![Page 2: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/2.jpg)
DARIAH
Digital Research Infrastructure for the Arts and Humanities
DARIAH-DE, german part of DARIAH
supports Digital Humanities by providing
digital methods and tools for research and educationa platform that enables the interconnection of various disciplinesa sustainable research infrastrucure
Jülich Supercomputing Center
Involved in the process of building an infrastructure which is generic, easy to
use, and provides state-of-the-art processing and storage services.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 2
![Page 3: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/3.jpg)
DARIAH
Digital Research Infrastructure for the Arts and Humanities
DARIAH-DE, german part of DARIAH
supports Digital Humanities by providing
digital methods and tools for research and educationa platform that enables the interconnection of various disciplinesa sustainable research infrastrucure
Jülich Supercomputing Center
Involved in the process of building an infrastructure which is generic, easy to
use, and provides state-of-the-art processing and storage services.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 2
![Page 4: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/4.jpg)
DARIAH-DE Storage Service
For bit-preservation purposes DARIAH-DE offers a Storage Service.
A researcher can use this service and
upload and download data objects using any HTTP client
expect that everything is stored in a safe manner (achieved using
replication across resources/computing centers)
The Storage Service is
providing a HTTP-based interface to storage resources
using a database to store basic metadata
relying on iRODS as its storage backend
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 3
![Page 5: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/5.jpg)
Why iRODS?
The integrated Rule-Oriented Data System was chosen because it provides
1 the rule-engine, allowing to modify the behavior of the system
actions, like acPostProcForPut, to react on system-eventsrules, written in a native language, providing loops, if-statements, ...microservices, the smallest pieces of work
many microservices already available, used and chained together in ruleswritten in C, advanced users can extend iRODS functionality
2 storage-drivers, abstracting various storage technologies
for file-system and some other storage-providers drivers are build intoiRODSimplementing a common set of interactions (create, move, delete, ...) onecan access any type of storage system
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 4
![Page 6: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/6.jpg)
iRODS Example
1 acPostProcForPut {2 ON( $objPath l i k e "∗ / sayhe l lo . do " ) {3 sampleRule ( " He l lo User ! " , ∗s ta tus ) ;4 }5 }6 # ∗ t e x t = input , ∗s ta tus = output7 sampleRule (∗ t ex t , ∗s ta tus ) {8 msiWriteRodsLog ("∗ t e x t " , ∗s ta tus ) ;9 }
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 5
![Page 7: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/7.jpg)
DARIAH-DE Storage Service
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 6
![Page 8: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/8.jpg)
Sample Repository ...
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 7
![Page 9: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/9.jpg)
... Processing Result
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 8
![Page 10: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/10.jpg)
Motivation
A researcher wants to extract information from the stored data objects
she can download the data and process them locally
waste her time, resources, and network bandwidthlack of processing power
iRODS provides microservices which can be used for processing
requires C expertiseand reconfiguration/recompilation of the server
Goal: Active Storage
Provide a long term storage with processing functionalities at one place and
without increasing the complexity of the existing Storage Service.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 9
![Page 11: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/11.jpg)
Motivation
A researcher wants to extract information from the stored data objects
she can download the data and process them locally
waste her time, resources, and network bandwidthlack of processing power
iRODS provides microservices which can be used for processing
requires C expertiseand reconfiguration/recompilation of the server
Goal: Active Storage
Provide a long term storage with processing functionalities at one place and
without increasing the complexity of the existing Storage Service.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 9
![Page 12: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/12.jpg)
Concept
The following decisions were made to extend the service
use the existing namespace to integrate a processing engine
utilize filesystem instructions (create, read, delete) to interface this engine
abstract the details of the underlying services
Generic Workspace
The researcher just has to interact with the namespace, everything is
provided at one place.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 10
![Page 13: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/13.jpg)
Big Data Processing
For the prototype we have select a processing engine, having few
requirements
addressable through iRODS
parallel processing of large amounts of data
We decided to use Hadoop for the prototype because it
implements the Map Reduce programming paradigm
is widely used in industry products for Big Data analysis
scales, if the prototype gets widely used we can grow the Hadoop-Cluster
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 11
![Page 14: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/14.jpg)
Big Data Processing
For the prototype we have select a processing engine, having few
requirements
addressable through iRODS
parallel processing of large amounts of data
We decided to use Hadoop for the prototype because it
implements the Map Reduce programming paradigm
is widely used in industry products for Big Data analysis
scales, if the prototype gets widely used we can grow the Hadoop-Cluster
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 11
![Page 15: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/15.jpg)
Apache Hadoop
Some information about the processing engine we have chosen
Open Source Framework
implements Map Reduce
based on Java
provides HDFS, a parallel filesystem that
divides files into chunks and distributes them over cluster nodesimplements replication
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 12
![Page 16: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/16.jpg)
How This Works Together
iRODS
rule-engine, triggering MapReduce jobs after file ingestion
storage-driver, moving incoming files to HDFS
Hadoop
execution of Map Reduce jobs
storing files for processing on HDFS
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 13
![Page 17: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/17.jpg)
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
![Page 18: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/18.jpg)
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
![Page 19: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/19.jpg)
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
![Page 20: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/20.jpg)
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
![Page 21: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/21.jpg)
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
![Page 22: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/22.jpg)
Technical Aspects
HDFS in iRODS:
part of a compound resource
files ingested into iRODS are uploaded to HDFS
currently using univMSSInterface.sh
“Job” management:
acPostProcForPut reacts on ingestion of */proc/*-like files
delayed rule that submits the Pig script and make the results available is
started with msiExecCmd
Scripts management:
one common iRODS collection with scripts
common parameters handling (at least input and output must be defined)
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 15
![Page 23: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/23.jpg)
Apache Pig
Apache Pig is a platform that creates Hadoop jobs, based on user-defined
SQL-like queries.
1 data = LOAD ’ path /∗ ’ USING TextLoader ( ) ;2 token = FOREACH data GENERATE FLATTEN(TOKENIZE( $0 ) ) AS word ;3 words = FILTER token BY word MATCHES ’ \ \ w+ ’ ;4 gr = GROUP words BY word ;5 c = FOREACH gr GENERATE COUNT( words ) AS cnt , gr ;6 res = ORDER c BY cnt ;7 STORE res INTO ’ path / output . dat ’
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 16
![Page 24: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/24.jpg)
Apache Pig
Apache Pig is a platform that creates Hadoop jobs, based on user-defined
SQL-like queries.
1 data = LOAD ’ path /∗ ’ USING TextLoader ( ) ;2 token = FOREACH data GENERATE FLATTEN(TOKENIZE( $0 ) ) AS word ;3 words = FILTER token BY word MATCHES ’ \ \ w+ ’ ;4 gr = GROUP words BY word ;5 c = FOREACH gr GENERATE COUNT( words ) AS cnt , gr ;6 res = ORDER c BY cnt ;7 STORE res INTO ’ path / output . dat ’
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 16
![Page 25: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/25.jpg)
Summary
Generic Workspace for Big Data Processing
implementation of a working prototype was done
follows the idea of an active storage with processing functionalities
instead of just storing data
uses a declarative approach, the user just has to define the expected
results
provides a Workspace that users, but also applications and other
services, can interact with
powerusers can extend the service by uploading Pig scripts
This prototype is extensible
other processing frameworks can be integrated
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 17
![Page 26: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/26.jpg)
Another iRODS Example
1 acPostProcForPut {2 ON( $objPath l i k e "∗ / proc / wordCount " ) {3 [ . . . ]4 msiSp l i tPa th (∗path , ∗p r o c c o l l e c t i o n , ∗jobname ) ;5 msiSp l i tPa th (∗ p r o c c o l l e c t i o n , ∗parent , ∗ ignored ) ;6 [ . . . ]7 ∗arg ="∗parent ∗output ∗jobname ∗s c r i p t C o l l e c t i o n " ;8 msiExecCmd ( " runPigJob . sh " , "∗ arg " , " n u l l " , " n u l l " , " n u l l " ,∗OUT) ;9 [ . . . ]
10 }11 }
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 18
![Page 27: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08](https://reader034.vdocument.in/reader034/viewer/2022052009/601e43b55014d4730652c7fb/html5/thumbnails/27.jpg)
Pig job wordfreq results
1 . . .2 1775 the3 1040 of4 730 i n5 677 and6 457 to7 343 was8 334 a9 331 und
10 248 die11 223 he12 . . .
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 19