3rd ieee workshop on many-task mtags computing on...
TRANSCRIPT
-
3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers
MTAGS 2010
Improving Many-Task Computing in Scientific Workflows Using P2P Techniques
Jonas Dias Eduardo Ogasawara
Daniel de Oliveira Esther Pacitti
Marta Mattoso
COPPE, Federal University of Rio de Janeiro, Brazil
INRIA & LIRMM, Montpellier, France
-
MTAGS 2010 Introduction
• Scientific Experiments
• Petascale Computing – Behavior of hundreds of thousands
processors
– Parallel Execution failures
• Scientific Workflows – Represent the chaining of activities of
an experiment
– Scientific Workflow Management Systems (SWfMS)
11/15/2010
Improving Many-Task Computing in Scientific Workflows Using P2P Techniques
2
Pre-processing
Execution Kernel
Pos-processing
Typical Scientific Workflow
-
MTAGS 2010 Experiment Execution
• The same workflow may run several times – 5000 parameter combinations to try
– 3 workflow variations
– Total of 15000 instances to be executed
• Motivation to parallelize – Accomplish the results timely
– Clusters, Grids and Clouds
• Utility Computing model – Give the answer when they are still necessary
11/15/2010
Improving Many-Task Computing in Scientific Workflows Using P2P Techniques
3
-
MTAGS 2010 Difficulties in Workflow Parallelism
• MPI – Complex and legacy codes
– Dynamic resource management
– A job’s process may fail • Compromise the whole execution
• Resubmitting relies on the scientist manual control – Not feasible for a huge number of tasks
• Grid Schedulers – Submit many Jobs simultaneously
– Waiting time on resource management queues
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 4
-
MTAGS 2010 MTC Workflow Parallelism
• Many-task computing (MTC) – Improve Parameter Sweep and Data Parallelism
• HPC Cluster Systems – Not very easy to setup Jobs to be submitted – Centralized control – Compute nodes may fail
• Open Issues – Best approaches to setup an experiment execution – Load balancing – Dynamic resource management – Control the failures
• What has failed and needs to be rescheduled?
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 5
-
MTAGS 2010 MTC, Workflows and Clusters
• The Heracles Approach
– Approach to execute workflow activities
• More transparent setup
• Load Balancing
• Quality of service
• Distributed Provenance Gathering
– Uses the P2P model
• To be implemented in a cluster scheduler
• Not P2P infrastructure
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 6
-
MTAGS 2010 Heracles Overview
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 7
Scientific Workflow Management System
Workflow MTC Scheduler
Heracles
Cluster
-
MTAGS 2010 Heracles Structure
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 8
SWfMS
-
MTAGS 2010 Heracles Structure
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 9
SWfMS
Workflow Instances Wrapper
Workflow MTC Scheduler
Cluster Scheduling
-
MTAGS 2010 Heracles Structure
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 10
SWfMS
Workflow Instances Wrapper
Workflow MTC Scheduler
Cluster Scheduling
He
racles
Task
-
MTAGS 2010 Heracles Structure
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 11
SWfMS
Workflow Instances Wrapper
Workflow MTC Scheduler
Cluster Scheduling
He
racles
Task Task
Task
Task Execution
Monitoring
Dis
trib
ute
d
Tab
le
Executer
Overlay Handler
Heracles Process
-
MTAGS 2010 Heracles Structure
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 12
SWfMS
Workflow Instances Wrapper
Workflow MTC Scheduler
Cluster Scheduling
He
racles
Task Task
Task
Task Execution
Monitoring
Dis
trib
ute
d
Tab
le
Executer
Overlay Handler
Heracles Process Process
-
MTAGS 2010 Heracles Structure
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 13
SWfMS
Workflow Instances Wrapper
Workflow MTC Scheduler
Cluster Scheduling
He
racles
Task Task
Task
Task Execution
Monitoring
Dis
trib
ute
d
Tab
le
Executer
Overlay Handler
Heracles Process Process
Resource Manager
Node Process
Node Process
Node Process
Node Process Cluster
-
MTAGS 2010 P2P view
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 14
Resource Manager
Node Process
Node Process
Node Process
Node Process Cluster
Process
Process
Process
Process
Heracles virtual P2P network view
-
MTAGS 2010 Heracles
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 15
-
MTAGS 2010 Transparency
• Setup the deadline, not the number of nodes
• Heracles controls the number of involved nodes
– Execution partial efficiency
– Automatically refresh the number of necessary processors
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 16
-
MTAGS 2010 Dynamic Scheduling example
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 17
0
20
40
60
80
100
120
140
160
180
200
0 5 10 15 20Hours
Completed tasks per hour Processing Cores
173 tasks per hour
64 cores
-
MTAGS 2010 Efficiency
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 18
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Hours
-
MTAGS 2010 Load Balancing
• Clusters depend on the head node control.
• Tasks can have their autonomy – Like P2P dynamic control
• Hierarchical organization – Based on P2P hierarchical
networks
– Group leaders
– Working nodes
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 19
-
MTAGS 2010 Quality of Service
• Job’s process failure – Hard to reschedule on traditional approaches
– Manual reschedule not feasible
– How to address it in the provenance collection
• P2P model can help – Autonomy of the nodes
– Unfinished or failed tasks can be rescheduled
– Provenance may register all execution attempts or the last execution attempt
11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 20
-
MTAGS 2010 When rescheduling?
• Group leaders are responsible for the decision – Distributed table data
• Status of the tasks on the distributed table – Pending, running or finished
• Average execution time of a task
• To reschedule means to change the status of the task to pending
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 21
-
MTAGS 2010 Case Study
• Analyze the impact of churn events on tasks execution on clusters
– Many workflow activities to be executed
– Activities are decomposed into tasks
• Suffer with churn events
– Activities producing 512, 1024, 2048 and 4096 tasks
– Tasks is classified as small, medium and large
– Seven days simulated
– Calibrated using real experiment data
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 22
-
MTAGS 2010 Rescheduling Types
• Manual Rescheduling
– Scientists checks activity status every twelve hours
– If a failure happens, all the tasks of the activity are rescheduled
• Automatic Rescheduling
– Only the task that has failed is rescheduled
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 23
-
MTAGS 2010 Small Tasks
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 24
-
MTAGS 2010 Medium Tasks
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 25
-
MTAGS 2010 Big Tasks
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 26
-
MTAGS 2010 Conclusions
• Empowering scientific experiments execution
– Scientific Workflow parallelization on huge clusters
– Many task computing
– Process failures, poor load balancing, usability issues
• Heracles Approach
– Transparency, load balance and quality of service
– Using P2P model on clusters
• Case study showed the gains with automatic rescheduling
11/15/2010
Improving Many-Task Computing in Scientific Workflows Using P2P Techniques
27
-
MTAGS 2010 Future Work
• Analyze the advantages that MTC schedulers can achieve when using full Heracles approach
• Using Heracles on real experiments
– Implementing it on real schedulers such as Hydra
• Evaluate other fault tolerant mechanisms such as redundant executions
11/15/2010 Improving Many-Task Computing in
Scientific Workflows Using P2P Techniques 28
-
MTAGS 2010 Acknowledgements
6/24/2010 A P2P Approach to Many Tasks Computing
for Scientific Workflows 29
-
3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers
MTAGS 2010
Improving Many-Task Computing in Scientific Workflows Using P2P Techniques
COPPE, Federal University of Rio de Janeiro, Brazil
INRIA & LIRMM, Montpellier, France