automating real-time seismic analysis through streaming and
TRANSCRIPT
Automating Real-time Seismic AnalysisThroughStreamingandHighThroughputWorkflows
RafaelFerreiradaSilva,Ph.D.
http://pegasus.isi.edu
Pegasus http://pegasus.isi.edu 2
Do we need seismic analysis?
Pegasus http://pegasus.isi.edu 3
USArrayA continental-scale Seismic Observatory
US Array TA (IRIS Service)836 stations
394 stations have online data available
http://ds.iris.edu/ds/nodes/dmc/earthscope/usarray/_US-TA-operational/
Pegasus http://pegasus.isi.edu 4
The development of reliable risk assessment methods for thesehazards requires real-time analysis of seismic data
Pegasus http://pegasus.isi.edu 5
So, how to efficiently process these data?
Pegasus http://pegasus.isi.edu 6
ScientificProblem
AnalyticalSolution
Computational Scripts
Automation
ExperimentTimeline
MonitoringandDebug
ScientificResultModels,QualityControl,
ImageAnalysis,etc.
Fault-tolerance,Provenance,etc.
Shellscripts,Python,Matlab,etc.
EarthScience, Astronomy,Neuroinformatics,Bioinformatics,etc.
DistributedComputing
Workflows,MapReduce,etc.
Clusters,HPC,Cloud,Grid,etc.
Pegasus http://pegasus.isi.edu 7
What is involved in an experiment execution?
Pegasus http://pegasus.isi.edu 8
Automate
Recover
Debug
Why Scientific Workflows?
Automates complex, multi-stage processing pipelines
Enables parallel, distributed computations
Automatically executes data transfers
Reusable, aids reproducibility
Records how data was produced (provenance)
Handles failures with to provide reliability
Keeps track of data and files
Pegasus http://pegasus.isi.edu 9
Taking a closer look into a workflow…
job
dependencyUsuallydatadependencies
split
merge
pipeline
Command-lineprograms
DAGdirected-acyclic graphs
abstract workflow
executable workflow
storage constraints
optimizations
Pegasus http://pegasus.isi.edu 10
From the abstraction to execution!
stage-in job
stage-out job
registration job
Transferstheworkflowinputdata
Transferstheworkflowoutputdata
Registerstheworkflowoutputdata
abstract workflow
executable workflow
storage constraints
optimizations
Pegasus http://pegasus.isi.edu 11
Optimizing storage usage…
cleanup jobRemovesunuseddata
abstract workflow
executable workflow
storage constraints
optimizations
Pegasus http://pegasus.isi.edu 12
Workflow systems provide tools togenerate the abstract workflow
dax = ADAG("test_dax")firstJob = Job(name="first_job")firstInputFile = File("input.txt")firstOutputFile = File("tmp.txt")firstJob.addArgument("input=input.txt", "output=tmp.txt")firstJob.uses(firstInputFile, link=Link.INPUT)firstJob.uses(firstOutputFile, link=Link.OUTPUT)dax.addJob(firstJob)for i in range(0, 5):
simulJob = Job(id="%s" % (i+1), name="simul_job”)simulInputFile = File("tmp.txt”)simulOutputFile = File("output.%d.dat" % i)simulJob.addArgument("parameter=%d" % i, "input=tmp.txt”,
output=%s" % simulOutputFile.getName())simulJob.uses(simulInputFile, link=Link.INPUT)simulJob.uses(simulOutputFile, line=Link.OUTPUT)
dax.addJob(simulJob)dax.depends(parent=firstJob, child=simulJob)fp = open("test.dax", "w”)dax.writeXML(fp)fp.close()
abstract workflow
W h i c h W o r k f l o wManagement System?
Pegasus http://pegasus.isi.edu 13
PegasusNimrod
Pegasus http://pegasus.isi.edu 14
…and which model to use?
task-oriented stream-based
files
parallelismnoconcurrency
datatransfersviafiles
heterogeneous executiontaskscanruninheterogeneousresources
streams
concurrencytasksrunconcurrently
datatransfersviamemoryormessage
homogeneous executiontasksshouldruninhomogeneousresources
Pegasus http://pegasus.isi.edu 15
What does Pegasus provide?
AutomationAutomatespipelineexecutions
Parallel,distributedcomputationsAutomaticallyexecutesdatatransfers
Heterogeneous resourcesTask-orientedmodel
Applicationisseenasablackbox
DebugWorkflowexecutionandjobperformancemetricsSetofdebuggingtoolstounveilissuesReal-timemonitoring,graphs,provenance
OptimizationJobclusteringDatacleanup
RecoveryJobfailuredetection
CheckpointFilesJobRetry
RescueDAGs
Pegasus http://pegasus.isi.edu 16
…and dispel4py?
AutomationAutomatespipelineexecutions
Concurrent,distributedcomputationsStream-basedmodel
WorkflowCompositionPythonLibraryGrouping(all-to-all,all-to-one,one-to-all)
OptimizationMultiple streams(in/out)AvoidsI/O(sharedmemoryormessagepassing)
MappingSequential
Multiprocessing (sharedmemory)Distributedmemory,messagepassing(MPI)
DistributedReal-time (ApacheStorm)ApacheSpark(Prototype)
Pegasus http://pegasus.isi.edu 17
Asterism greatlysimplifiestheeffortrequiredtodevelop
data-intensive applicationsthatrunacrossmultiple
heterogeneous resourcesdistributedinthewidearea
Pegasus
sub-workflowdispel4pyworkflow
ASTERISM
Pegasus http://pegasus.isi.edu 18
Where to run scientific workflows?
High Performance Computing
http://pegasus.isi.edu 19Pegasus
sharedfilesystem
submit host(e.g.,user’slaptop)
There are several possible configurations…
typically most HPC sites
WorkflowEngine
Cloud Computing
http://pegasus.isi.edu 20Pegasus
objectstorage
submit host(e.g.,user’slaptop)
High-scalable object storages
Typical cloud computing deployment (Amazon S3,
Google Storage)
WorkflowEngine
http://pegasus.isi.edu 21Pegasus
submit host(e.g.,user’slaptop)
local data management
Typical OSG sitesOpenScience Grid
WorkflowEngine
Grid Computing
http://pegasus.isi.edu 22Pegasus
sharedfilesystem
submit host(e.g.,user’slaptop)
And yes… you can mix everything!
ComputesiteB
ComputesiteA
object storage
WorkflowEngine
Pegasus http://pegasus.isi.edu 23
How do we use Asterism to automate seismic analysis?
WORKFLOW
Pegasus http://pegasus.isi.edu 24
Phase1(pre-process)
Phase2(cross-correlation)
Seismic Ambient Noise Cross-Correlation
Preprocesses and cross-correlates traces (sequences of measurements of acceleration in three dimensions) from multiple seismic stations (IRIS database)
Phase 1: data preparation using statistics for extracting information from the noise
Phase 2: compute correlation, identifying the timefor signals to travel between stations. Infers properties of the intervening rock
Seismic Ambient Noise Cross-Correlation
Pegasus http://pegasus.isi.edu 25
Distributed computation framework for event stream processing
Designed for massive scalability, supports fault-tolerance with a “fail fast, auto restart” approach to processes
Rich array of available spouts specialized for receiving data from all types of sources
Hadoop of real-time processing, very scalablespout
bolt
http://pegasus.isi.edu 26
Seismic workflow execution
Pegasus
IRISdatabase(stations)
datatransfersbetweensitesperformedbyPegasus
ComputesiteB(ApacheStorm)
ComputesiteA(MPI-based)
input data (~150MB)
submit host(e.g.,user’slaptop)
output data (~40GB)
Phase1
Phase2
WORKFLOW
Pegasus http://pegasus.isi.edu 27
Southern California Earthquake Center’s CyberShake
Builders ask seismologists: What will the peak ground motion be at my new building in the next 50 years?
Seismologists answer this question using Probabilistic Seismic Hazard Analysis (PSHA)
286 sites, 4 modelseach workflow has 420,000 tasks
Pegasus http://pegasus.isi.edu 28
A few more workflow features…
Pegasus http://pegasus.isi.edu 29
Performance, why not improve it?
clustered jobGroupssmall jobstogethertoimproveperformance
tasksmallgranularity
workflow restructuring
workflow reduction
pegasus-mpi-cluster
hierarchical workflows
Pegasus http://pegasus.isi.edu 30
What about data reuse?
data alreadyavailable
JobswhichoutputdataisalreadyavailableareprunedfromtheDAG
data reuse
workflow restructuring
workflow reduction
pegasus-mpi-cluster
hierarchical workflows
workflowreduction
data alsoavailable
data reuse
http://pegasus.isi.edu 31
Handling large-scaleworkflows
pegasus-mpi-cluster
recursion endswhen workflow withonly compute jobsis encountered
sub-workflow
sub-workflow
workflow restructuring
workflow reduction
hierarchical workflows
Pegasus http://pegasus.isi.edu 32
Running fine-grainedworkflows on HPC systems…
pegasus-mpi-cluster
HPCSystemsubmit host(e.g.,user’slaptop)
workflow wrapped as an MPI jobAllowssub-graphs ofaPegasusworkflowtobe
submittedasmonolithic jobstoremoteresources
workflow restructuring
workflow reduction
hierarchical workflows
PegasusAutomate, recover,anddebug scientificcomputations.
Get Started
PegasusWebsitehttp://pegasus.isi.edu
HipChat
Thank You
Questions?
RafaelFerreiradaSilva,[email protected] KaranVahi
RafaelFerreiradaSilva
RajivMayani
MatsRynge
Ewa Deelman
Automating Real-time Seismic AnalysisThroughStreamingandHighThroughputWorkflows