High Performance Computing for Science and Engineering II
22032021 - Lecture High Throughput Computing
Lecturer Dr Sergio Martin
A Recap of Last WeekIntroduction to UQ and Optimization Software - Overview of Koralis features and installation tutorial
2
Optimization with Korali - Used CMAES to optimize a simple functionProbability Sampling - Used MCMC to sample the shape of a unnormalized distributionBayesian Inference - Used TMCMC to infer the best fitting parameters of a model giving reference data
Please download the rest of the practices (5-11) from the website
Todays Lecture
Sample Distribution Strategies- Load Imbalance - Divide amp Conquer vs ProducerConsumer
Korali Tutorial (Part II)- Concurrent Distributed Parallelism - Fault Tolerance Multi-Experiment Support
MPI and Sample Distribution- One-Sided Communication - Example (Genome Assembly)
Sample DistributionStrategies
Sample Distribution
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
How do we distribute samples to cores
Sample 0
Divide-And-Conquer Strategy
Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast
Only applicable when the entire workload is known from the beginning
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Distribute samples equally (in number) among cores at the start of every generation
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time = Max(Core Time)
Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)
Idle
Idle
Idle
Happens when cores receive uneven workloadsRepresents a waste of computational power
Sample 0
Sample 2
Sample 4
Sample 6
Sample 1
Sample 3
Sample 5
Sample 7
Producer Consumer Model
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Assign workload opportunistically as coreswork become available
Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers
Does not require the entireknowing the workload in advance
Producer
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite
Lost Performance = ProducerCoresTotalCores
Pop Quiz Why do we need to sacrifice one worker node
Sample 6
Sample 7
Sample 5
Sample 3
Sample 4Sample 0
Sample 1
Sample 2
Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
A Recap of Last WeekIntroduction to UQ and Optimization Software - Overview of Koralis features and installation tutorial
2
Optimization with Korali - Used CMAES to optimize a simple functionProbability Sampling - Used MCMC to sample the shape of a unnormalized distributionBayesian Inference - Used TMCMC to infer the best fitting parameters of a model giving reference data
Please download the rest of the practices (5-11) from the website
Todays Lecture
Sample Distribution Strategies- Load Imbalance - Divide amp Conquer vs ProducerConsumer
Korali Tutorial (Part II)- Concurrent Distributed Parallelism - Fault Tolerance Multi-Experiment Support
MPI and Sample Distribution- One-Sided Communication - Example (Genome Assembly)
Sample DistributionStrategies
Sample Distribution
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
How do we distribute samples to cores
Sample 0
Divide-And-Conquer Strategy
Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast
Only applicable when the entire workload is known from the beginning
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Distribute samples equally (in number) among cores at the start of every generation
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time = Max(Core Time)
Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)
Idle
Idle
Idle
Happens when cores receive uneven workloadsRepresents a waste of computational power
Sample 0
Sample 2
Sample 4
Sample 6
Sample 1
Sample 3
Sample 5
Sample 7
Producer Consumer Model
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Assign workload opportunistically as coreswork become available
Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers
Does not require the entireknowing the workload in advance
Producer
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite
Lost Performance = ProducerCoresTotalCores
Pop Quiz Why do we need to sacrifice one worker node
Sample 6
Sample 7
Sample 5
Sample 3
Sample 4Sample 0
Sample 1
Sample 2
Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Todays Lecture
Sample Distribution Strategies- Load Imbalance - Divide amp Conquer vs ProducerConsumer
Korali Tutorial (Part II)- Concurrent Distributed Parallelism - Fault Tolerance Multi-Experiment Support
MPI and Sample Distribution- One-Sided Communication - Example (Genome Assembly)
Sample DistributionStrategies
Sample Distribution
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
How do we distribute samples to cores
Sample 0
Divide-And-Conquer Strategy
Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast
Only applicable when the entire workload is known from the beginning
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Distribute samples equally (in number) among cores at the start of every generation
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time = Max(Core Time)
Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)
Idle
Idle
Idle
Happens when cores receive uneven workloadsRepresents a waste of computational power
Sample 0
Sample 2
Sample 4
Sample 6
Sample 1
Sample 3
Sample 5
Sample 7
Producer Consumer Model
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Assign workload opportunistically as coreswork become available
Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers
Does not require the entireknowing the workload in advance
Producer
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite
Lost Performance = ProducerCoresTotalCores
Pop Quiz Why do we need to sacrifice one worker node
Sample 6
Sample 7
Sample 5
Sample 3
Sample 4Sample 0
Sample 1
Sample 2
Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Sample DistributionStrategies
Sample Distribution
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
How do we distribute samples to cores
Sample 0
Divide-And-Conquer Strategy
Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast
Only applicable when the entire workload is known from the beginning
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Distribute samples equally (in number) among cores at the start of every generation
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time = Max(Core Time)
Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)
Idle
Idle
Idle
Happens when cores receive uneven workloadsRepresents a waste of computational power
Sample 0
Sample 2
Sample 4
Sample 6
Sample 1
Sample 3
Sample 5
Sample 7
Producer Consumer Model
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Assign workload opportunistically as coreswork become available
Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers
Does not require the entireknowing the workload in advance
Producer
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite
Lost Performance = ProducerCoresTotalCores
Pop Quiz Why do we need to sacrifice one worker node
Sample 6
Sample 7
Sample 5
Sample 3
Sample 4Sample 0
Sample 1
Sample 2
Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Sample Distribution
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
How do we distribute samples to cores
Sample 0
Divide-And-Conquer Strategy
Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast
Only applicable when the entire workload is known from the beginning
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Distribute samples equally (in number) among cores at the start of every generation
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time = Max(Core Time)
Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)
Idle
Idle
Idle
Happens when cores receive uneven workloadsRepresents a waste of computational power
Sample 0
Sample 2
Sample 4
Sample 6
Sample 1
Sample 3
Sample 5
Sample 7
Producer Consumer Model
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Assign workload opportunistically as coreswork become available
Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers
Does not require the entireknowing the workload in advance
Producer
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite
Lost Performance = ProducerCoresTotalCores
Pop Quiz Why do we need to sacrifice one worker node
Sample 6
Sample 7
Sample 5
Sample 3
Sample 4Sample 0
Sample 1
Sample 2
Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Divide-And-Conquer Strategy
Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast
Only applicable when the entire workload is known from the beginning
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Distribute samples equally (in number) among cores at the start of every generation
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time = Max(Core Time)
Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)
Idle
Idle
Idle
Happens when cores receive uneven workloadsRepresents a waste of computational power
Sample 0
Sample 2
Sample 4
Sample 6
Sample 1
Sample 3
Sample 5
Sample 7
Producer Consumer Model
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Assign workload opportunistically as coreswork become available
Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers
Does not require the entireknowing the workload in advance
Producer
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite
Lost Performance = ProducerCoresTotalCores
Pop Quiz Why do we need to sacrifice one worker node
Sample 6
Sample 7
Sample 5
Sample 3
Sample 4Sample 0
Sample 1
Sample 2
Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time = Max(Core Time)
Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)
Idle
Idle
Idle
Happens when cores receive uneven workloadsRepresents a waste of computational power
Sample 0
Sample 2
Sample 4
Sample 6
Sample 1
Sample 3
Sample 5
Sample 7
Producer Consumer Model
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Assign workload opportunistically as coreswork become available
Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers
Does not require the entireknowing the workload in advance
Producer
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite
Lost Performance = ProducerCoresTotalCores
Pop Quiz Why do we need to sacrifice one worker node
Sample 6
Sample 7
Sample 5
Sample 3
Sample 4Sample 0
Sample 1
Sample 2
Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Producer Consumer Model
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Assign workload opportunistically as coreswork become available
Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers
Does not require the entireknowing the workload in advance
Producer
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite
Lost Performance = ProducerCoresTotalCores
Pop Quiz Why do we need to sacrifice one worker node
Sample 6
Sample 7
Sample 5
Sample 3
Sample 4Sample 0
Sample 1
Sample 2
Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Producer
Load Imbalance
Node 0
Node 0
Node 0
Node 0
Parallel Sampler - Single-Core Model
Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite
Lost Performance = ProducerCoresTotalCores
Pop Quiz Why do we need to sacrifice one worker node
Sample 6
Sample 7
Sample 5
Sample 3
Sample 4Sample 0
Sample 1
Sample 2
Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Generation-Based Methods
All samples for the next generation are knownat the end of the previous generation
CMA-ES TMCMC
Samples for the current generationare determined in real-time based on the
evaluation of previous chain steps
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Lets Discuss
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is the Divide and Conquer Strategy
Good for CMA-ES What about TMCMC
Q2 Is the ProducerConsumer Strategy
Good for CMA-ES And for TMCMC
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
High Throughput Computing with Korali
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
13
Study Case Heating Plate
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Study Case Heating PlateGiven
A square metal plate with 3 sources of heat underneath it
Can we infer the (xy) locations of the 3 heat sources
We have ~10 temperature measurements at different locations
14
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Study Case Configuration
Experiment Problem Bayesian Inference
Model C++ 2D Heat Equation
Solver TMCMC
Run
Heat Source 1
Heat Source 2
Heat Source 3
X Y
Likelihood Probability Distributions
15
Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)
Objective Function Likelihood by Reference Data
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Practice 6 Running Study Case
Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions
16
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
17
Parallel Execution
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Heterogeneous Model Support
+ Sequential (default) Good for simple function-based PythonC++ models
+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)
+ Distributed For MPIUPC++ distributed models (eg Mirheo)
Korali exposes multiple ldquoConduitsrdquo ways to run computational models
18
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Sequential ConduitLinks to the model code and runs the model sequentially via function call
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)
Korali Application
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result
Computational Model
$ myKoraliApppy
Running Application
19
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel
e = koraliExperiment()
k = koraliEngine()
k[Conduitrdquo][Type] = Concurrent
k[Conduitrdquo][Concurrent Jobs] = 4
krun(e)
Korali Application
$ myKoraliApppy
Running Application
Korali Main Process
Worker 0
Worker 1
Worker 2
Worker 3
Fork
Join
Sample Sample SampleSample
SampleSampleSample
Sample
20
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Practice 7 Parallelize Study Case
Step I Go to folder practice7 and
Use the concurrent conduit to parallelize the code in practice 6
21
Step II Analyze running times by running different levels of parallelism
Step III Use the top command to observe the CPU usage while you run the example
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
22
Distributed Execution
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation
sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
23
Korali Engine Rank
Node 1
Node 2
Node 3
Node 4
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams
e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4
krun(e)
Korali Application
def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result
Computational Model
$ mpirun -n 17 myKoraliApppy
Running Application
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Rank 7
Rank 8
Rank 9
Rank 10
Rank 11
Rank 12
Rank 13
Rank 14
Rank 15
Subcomm 0
Subcomm 1
Subcomm 2
Subcomm 3
24
Korali Engine Rank
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Idle
Idle
Idle
Idle
Koralirsquos Scalable Sampler
Start ExperimentSamples
Busy
Busy
Busy
Busy
Done
Done
Done
Done
Save Results Check For Termination
Run Next Generation
Idle
Idle
Idle
Idle25
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Practice 8 MPI-Based Distributed Models
Step II Go to folder practice8 and have Korali run the the MPI-based model there
26
Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of
1) Sampling parallelism2) Model Parallelism
Step IV Configure Korali to store Profiling Information and use the profiler tool to see the
evolution of the samplesUsing Korali gt Tools gt Korali Profiler
Step 0 Getinstall any MPI library (openMPI is open-source)
Step I Use the distributed conduit to parallelize practice7
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
27
Running Out-of-the-box applications
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result
Computational Model
For these cases we can run them from inside a model and then gather the results
Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others
e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)
Korali Application
$ myKoraliApppy
Running Application
28
myAppmyApp x y Result
ResultFileout
parseFile(ResultFileout)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Practice 9 Running out-of-the-box applicationsStep I
Go to folder practice9 and examine the model application (what are its inputs and outputs)
29
Step II Modify the Korali applications objective model to run the application specifying its
inputs and gathering its output
Step III Run the application with different levels of concurrency
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
30
Running Multiple Experiments
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Scheduling Multiple Experiments
Samples
SamplesIdle
Done
Busy
Busy
Start Experiments
31
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Effect of Simultaneous ExecutionRunning Experiments Sequentially
Average Efficiency 739
Running Experiments Simultaneously
Average Efficiency 978
32
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Practice 10 Running Multiple Experiments
Step I Go to folder practice10 and examine the Korali Application
33
Step II Run the application in parallel and use the profiler tool too see how the experiments
executed
Step III Change the Korali application to run all experiments simultaneously
Step IV Run and profile the application again and compare the results with those of Step II
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
34
Resuming Previous Experiments
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Self-Enforced Fault Tolerance
Korali saves the entire state of the experiment(s) at every generation
Gen 1
Gen 1Gen 0
Gen 0 Gen 2
Gen 2
Gen 3
Gen 3
Time (Hours)
Slurm Job 1 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Fatal Failure
Gen 4
Gen 4
Final
Final
Slurm Job 2 (4000 Nodes)
Experiment 0
Experiment 1
Korali Engine
Korali can resume any Solver Problem Conduit combination35
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Practice 11 Running Multiple Experiments
Step I Go to folder practice11 and examine the Korali Application
36
Step II Run the application to completion (10 generations) taking note of the final result
Step III Delete the results folder and change the Korali application to run only the first 55
generations (with this we simulate that an error has occurred)
Step IV Now change the application again to run the last 5 generations
Step V Compare the results with that of an uninterrupted run
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
MPI and Sample Distribution A Discussion
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message
MessageMPI_Recv()MPI_Send()
Intermediate Buffer
A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)
It does not encode semantics the receiver needs to know what to do with the data
MPI De facto communication standard for high-performance scientific applications
A Review of MPI
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
One-sided Communication A process can directly access a shared partition in another address space
MPI_Put()MPI_Get()
One-Sided Communication
Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary
It only encodes one piece of information data
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
A Good Case for MPI Iterative Solvers
Traditional Decomposition
1 Process (Rank) per Core
Node
Core 0 Core 1
Core 2 Core 3
Iteratively approaches a solution
Ranks Exchange Halo (Boundary) Cells
Structured Grid Stencil Solver
2D Grid
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Regular Communication
TimeCore Usage Timeline
Conventional Decomposition (1 Rank Core)
R0
Network
R0
Network
Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases
R0
Useful Computation
Network Communication Cost
Intra-Node Data Motion Cost
Computation Phase
Network
Communication Phase
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
A NOT so Good Case for MPI Genome Assembly
Original DNA
Re-assembled DNA
Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table
Image Credit Slide Credit Scott B Baden (Berkeley Lab)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
A NOT so Good Case for MPI Genome Assembly
Initial Segment of DNA ACTCGATGCTCAATG
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA
Hash Table for Rank 1
TGCT-gtGCTC TCAA-gtCAAT-gtAATG
Hash Table for Rank 0
Rank 0 Rank 1
Detect new edgeUpdate Hash Table
Detect coinciding hash
Build k-mer graphs from independent segments sharing their hash numbers
GATG-gtATGC ACTC-gtCTCG-gtTCGA
TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG
Hash Table for Rank 1
TGCT-gtGCTC
Hash Table for Rank 0
Align K-mers
Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates
Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)
Difficult to implement on MPI due to its asynchronicity
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)
Lets Discuss
Sample 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Node 0 Core 0
Node 0 Core 1
Node 0 Core 2
Node 0 Core 3
Sample 0
Q1 Is MPI a good model for the
divide-and-conquer strategy
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Node 0 Core 0
Producer
Node 0 Core 1
Consumer
Node 0 Core 2
Consumer
Node 0 Core 3
Consumer
Sample 0
Q2 Is MPI a good model for the
ProducerConsumer strategy
Asynchronous communication models might be better in these cases (eg UPC++)