computer measurement group, india 0 0 hpc tutorial manoj nambiar, performance engineering...
TRANSCRIPT
![Page 1: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/1.jpg)
Computer Measurement Group, India 1Computer Measurement Group, India 1
www.cmgindia.org
HPC TutorialManoj Nambiar, Performance Engineering Innovation LabsParallelization and Optimization CoE
![Page 2: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/2.jpg)
Computer Measurement Group, India 2
A Common Expectation
- 2 -
Our ERP application has slowed down. All the
departments are complaining.
Let’s use HPC
![Page 3: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/3.jpg)
Computer Measurement Group, India 3
Agenda
• Part – I – A sample domain problem– Hardware & Software
• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation– OpenFOAM - CFD application (if time permits)
![Page 4: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/4.jpg)
Computer Measurement Group, India 4
Thank You
Designing an Airplane for performance ……
Problem: Calculate Total Lift and Drag on the plane for a wind-speed of 150 m/s
![Page 5: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/5.jpg)
Computer Measurement Group, India 5
Performance Assurance – Airplanes vs Software
AssuranceApproach
Airplane Software
Testing Wind Tunnel Testing Load Testing with virtual users
Simulation CFD Simulation Discrete Event Simulation
Analytical None MVA, BCMP, M/M/k etc
AccuracyCost
![Page 6: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/6.jpg)
Computer Measurement Group, India 6
Thank You
CFD Example – Problem Decomposition
Methodology
1.Partition volume into cells
2. For a number of time steps
2.a For each cell 2.a.1 calculate velocities 2.a.2 calculate pressure 2.a.3 calculate turbulence
All cells have to be in equilibrium with each other.
Becomes a large AX=b problem. This problem is partitioned into groups of cells which are assigned to CPUsEach CPU can compute in parallel but the also have to communicate to each other
![Page 7: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/7.jpg)
Computer Measurement Group, India 7
A serial algorith for Ax = B
Compute Complexity – O(n2)
![Page 8: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/8.jpg)
Computer Measurement Group, India 8
What kind of H/W and S/W do we need
• Take an example Ax=b solver– Order of computational complexity is n2
– Where n is the number of cells in which the domain is divided
• Higher the number of cells – Higher the accuracy
• Typical number of cells– In 10’s of millions
• Very prohibitive to run sequentially
• Increase in memory requirements will need proportionally higher number of servers
Parallel implementation is needed on a large cluster or servers
![Page 9: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/9.jpg)
Computer Measurement Group, India 9
Software
• Lets look at the software aspect first
– Then we look at the hardware
![Page 10: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/10.jpg)
Computer Measurement Group, India 10
Work Load Balancing• After solving Ax=B
– Some elements of x need to be exchanged with neighbor groups
– Every group (process) has to send and receive values with its neighbors• For the next Guass Seidel iteration
Also need to check that all values of x have converged
Should this using TCP/IP or 3 tier web/app/database architecture?
![Page 11: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/11.jpg)
Computer Measurement Group, India 11
Why TCP/IP wont suffice
• Philosophically – NO– These parallel programs are peers– No one process is client or server
• Technically – NO– There can be as much as 10000 parallel processes
• Need to keep a directory of public server IP and port for each process– TCP is a stream oriented protocol
• Applications need to pass messages
• Changing the size of the cluster is tedious
![Page 12: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/12.jpg)
Computer Measurement Group, India 12
Why 3 tier application will not suffice?
• 3 tier applications are meant to serve end user transactions– This application is not transactional
• Database is not needed for these applications– No need to first persist and then read data
• This kind of I/O will impact performance significantly• Better to store data in RAM
– ACID properties of the database are not required• Applications are not transactional in nature
– SQL is a major overhead considering data velocity requirements
• Managed Frameworks like J2EE, .NET not optimal for such requirements
![Page 13: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/13.jpg)
Computer Measurement Group, India 13
MPI to the rescue
• A message oriented interface
• Has an API sparring 300 functions– Support complex messaging requirements
• A very simple interface for parallel programming
• Also portable regardless of the size of the deployment cluster
![Page 14: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/14.jpg)
Computer Measurement Group, India 14
MPI_Functions
• MPI_Send
• MPI_Recv
• MPI_Wait
• MPI_Reduce– SUM– MIN– MAX– ….
![Page 15: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/15.jpg)
Computer Measurement Group, India 15
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
![Page 16: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/16.jpg)
Computer Measurement Group, India 16
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
![Page 17: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/17.jpg)
Computer Measurement Group, India 17
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
![Page 18: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/18.jpg)
Computer Measurement Group, India 18
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
![Page 19: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/19.jpg)
Computer Measurement Group, India 19
Sample MPI program – parallel addition of a large array
![Page 20: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/20.jpg)
Computer Measurement Group, India 20
MPI – Send, Recv and Wait
If you have some computation to be donewhile waiting to receive a message from a peerThis is the place to do it
![Page 21: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/21.jpg)
Computer Measurement Group, India 21
Hardware
• Lets look at the Hardware
– Clusters– Servers– Coprocessors– Parallel File System
![Page 22: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/22.jpg)
Computer Measurement Group, India 22
HPC Cluster
Not very different from regular data center clusters
![Page 23: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/23.jpg)
Computer Measurement Group, India 23
Now lets look inside a server
Coprocessor’s go here
NUMA
![Page 24: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/24.jpg)
Computer Measurement Group, India 24
Parallelism in Hardware
• Multi-server/Multi-node
• Multi-sockets
• Multi-core
• Co-processors– Many Core– GPU
• Vector Processing
Mult-socket server board
Multi-core CPU
![Page 25: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/25.jpg)
Computer Measurement Group, India 25
Coprocessor - GPU
• SM – Streaming Multi-processor• Device RAM – high speed GDDR5 RAM• Extreme multi-threading – thousands of threads
PCIE Card
![Page 26: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/26.jpg)
Computer Measurement Group, India 26
Inside a GPU streaming multiprocessor (SM)• An SM can be compared to a CPU core
• A GPU core is essentially an ALU
• All cores execute the same instruction at a time– What happens to “if-then-else”?
• A warp is software equivalent of a CPU thread.– Scheduled independently– A warp instruction executed by all cores at a time
• Many warps can be scheduled on an SM– Just like many threads on a CPU– When 1 warp is scheduled to run other warps are moving data
• A collection of warps concurrently running on an SM make a block– Conversely an SM can run only one block at a time
Efficiency is achieved when there is one warp in 1 stage of the execution pipeline
![Page 27: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/27.jpg)
Computer Measurement Group, India 27
How S/W runs on the GPU
1. A CPU process/thread initiates data transfer from CPU memory to GPU memory
2. The CPU invokes a function (kernel) that runs on the GPU– CPU specifies the number of blocks and blocks per thread– Each block is scheduled on one SM– After all blocks complete execution CPU is woken
3. CPU fetches the kernel output from the GPU memoryThis is known as offload mode of execution
![Page 28: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/28.jpg)
Computer Measurement Group, India 28
Co-Processor – Many Integrated Core (MiC)
• Cores are same as Intel Pentium CPU’s– With vector processing instructions
• L2 Level cache is accessible by all the cores
Execution Modes• Native• Offload• Symmetric
![Page 29: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/29.jpg)
Computer Measurement Group, India 29
What is vector processing?
A B
C
A1 A2 A3 A4 A5 A6 A7 A8 B1 B2 B3 B4 B5 B6 B7 B8
C1 C2 C3 C4 C5 C6 C7 C8
ALU in an ordinary CPU core
ALU in an CPU core with vector processing
Vector registers
1 arithmetic operation per instruction cycle
8 arithmetic operations per instruction cycle
for(i=1; i< 8; i++) c[i] = a[i]+b[i];
VADD C, A, B
ADD C, A, B
![Page 30: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/30.jpg)
Computer Measurement Group, India 30
HPC Networks – Bandwidth and Latency
![Page 31: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/31.jpg)
Computer Measurement Group, India 31
Hierarchical network
• The most intuitive design of a network– Not uncommon in data centers
• What happens when the 1st 8 nodes need to communicate to the next 8?– Remember that all links have the same bandwidth
Top of rack
End of row switch
![Page 32: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/32.jpg)
Computer Measurement Group, India 32
Clos Network
• Can be likened to a replicated hierarchical network– All nodes can talk to all other nodes – Dynamic routing capability essential in the switches
![Page 33: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/33.jpg)
Computer Measurement Group, India 33
Common HPC Network Technology - Infiniband
• Technology used for building high throughput low latency network– Competes with Ethernet
• To use Infiniband – You need a separate NiC on the server– An Infiniband switch– An Infiniband Cable
• Messaging supported in Infiniband– a direct memory access read from or, write to, a remote node
• (RDMA).– a channel send or receive– a transaction-based operation (that can be reversed)– a multicast transmission.– an atomic operation
![Page 34: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/34.jpg)
Computer Measurement Group, India 34
Parallel File Systems - Lustre
• Parallel file systems give the same file system interface to legacy applications• Can be built out of commodity hardware and storage.
![Page 35: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/35.jpg)
Computer Measurement Group, India 35
HPC Applications - Modeling and Simulation• Aerodynamics
– Vehicular design
• Energy and Resources– Seismic Analysis– Geo-Physics– Mining
• Molecular Dynamics– Drug Discovery– Structural Biology
• Weather Forecasting
Simulation OR Physical Experimentation
Prototype
Final Design
Lab Verification
HPC or no HPC?
Accuracy Speed
Power Cost
From Natural Science to Software
![Page 36: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/36.jpg)
Computer Measurement Group, India 36
Relatively Newer & Upcoming Applications• Finance
– Risk Computations– Options Pricing– Fraud Detection– Low Latency trading
• Image Processing– Medical Imaging– Image Analysis– Enhancement and Restoration
• Bio-Informatics– Genomics
Video Analytics– Face Detection– Surveillance
Internet of Things– Smart City– Smart Water– eHealth
Knowledge of core algorithms is key
![Page 37: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/37.jpg)
Computer Measurement Group, India 37
Technology Trends Impacting Performance & Availability
• Multi-Core, Speeds not increasing
• Memory Evolution– Lower memory per core– Relatively Low Memory Bandwidth– Deep Cache & Memory Hierarchies
• Heterogeneous Computing– Coprocessors.
• Vector Processing
Temperature fluctuation induced slowdowns
Memory error induced slowdowns
Network communication errors
Large sized cluster– Increased failure probability
Algorithms need to be re-engineered to make best use of trends
![Page 38: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/38.jpg)
Computer Measurement Group, India 38
Knowing Performance Bounds
• Amdahl’s Law– Maximum speed up achievable sp = (s + (1-s)/p)-1
– Where s is the fraction of code that has to run sequentially
Also Important to take problem size into account when estimating speedups
Compute/Communication ratio is key.
Typically – Higher the problem size - higher the the ratio- Better the speed up
![Page 39: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/39.jpg)
Computer Measurement Group, India 39
Quick Hardware Recap
FLOPS Bound
Bandwidth Bound
What about server clusters?
![Page 40: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/40.jpg)
Computer Measurement Group, India 40
FLOPS and Bandwidth dependencies
• FLOPS – Floating operations per second– Frequency– No of CPU sockets– No of cores/per socket– No of Hyper-threads per core– No of vector units per core / hyperthead
• Bandwidths (Bytes/sec)– Level in the hierarchy – Registers, L1, L2, L3, DRAM– Serial / Parallel– Memory attached to same CPU socket or another CPU
Why are we not talking about memory latencies?
![Page 41: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/41.jpg)
Computer Measurement Group, India 41
Know your performance bounds
• Above information can also be obtained from product data sheets• What do you gain by knowing performance bounds?
GPU
![Page 42: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/42.jpg)
Computer Measurement Group, India 42
Other ways to gauge performance
• CPU speed– SPEC – integer and floating point benchmark
• Memory Bandwidth– Streams benchmark
![Page 43: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/43.jpg)
Computer Measurement Group, India 43
Basic Problem
• Consider the following code– double a[N], b[N], c[N], d[N];– int i;– for (i = 0; i < N-1; i++) a[i] = b[i] + c[i]*d[i];
• If N = 1012
• And the code has to complete in 1 second?– How many Xeon E5-2670 CPU sockets would you need?– Is this memory bound or CPU bound?
![Page 44: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/44.jpg)
Computer Measurement Group, India 44
General guiding principles for performance optimization
• Minimize communication requirements between parallel processes / threads
• If communication is essential then– Hide communication delays by overlapping compute and communication
• Maximize data locality– Helps caching– Good NUMA page placement
• Do not forget to use compiler optimization flags
• Implement weighted decomposition of workload– In a cluster with heterogeneous compute capabilities
Let your profiling results guide you on the next steps
![Page 45: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/45.jpg)
Computer Measurement Group, India 45
Optimization Guidelines for GPU platforms• Minimize use of “if-then-else” or any other branching
– they cause divergence
• Tune the number of threads per block– Too many will exhaust caches and registers in the SM– Too few will underutilize GPU capacity
• Use device memory for constants
• Use shared memory for frequently accessed data
• Use sequential memory access instead of strided
• Coalesce memory accesses
• Use streams to overlap compute and communications
![Page 46: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/46.jpg)
Computer Measurement Group, India 46
Steps in designing parallel programs
• Partitioning
• Communication
• Agglomeration
• Mapping
Data Structure Primitive Tasks
![Page 47: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/47.jpg)
Computer Measurement Group, India 47
Steps in designing parallel programs
• Partitioning
• Communication
• Agglomeration
• Mapping
• Combine sender and receiver• Eliminate communication• Increase Locality
• Combine senders and receivers• Reduces number of message transmissions
![Page 48: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/48.jpg)
Computer Measurement Group, India 48
Steps in designing parallel programs
• Partitioning
• Communication
• Agglomeration
• Mapping
NODE 1 NODE 2 NODE 3
![Page 49: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/49.jpg)
Computer Measurement Group, India 49
Agenda
• Part – I – A sample domain problem– Hardware & Software
• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation– OpenFOAM - CFD application Xeon Phi (if time permits)
![Page 50: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/50.jpg)
Computer Measurement Group, India 50
Multi-core Performance Enhancement: Case Study
![Page 51: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/51.jpg)
Computer Measurement Group, India 51
Background
• Risk Management in a commodities exchange
• Risk computed post trade– Clearing and settlement – T+2
• Risk details updated on screen– Alerting is controlled by human operators
![Page 52: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/52.jpg)
Computer Measurement Group, India 52Computer Measurement Group, India
Commodities Exchange: Online Risk Management
Trading System
RiskManagementSystem
CollateralFallsShort
Prevent Client/Clearing Member from Trading
OnlineTrades
Alerts
Initial Deposit of Collateral Long/Short Positions on
Contracts Contract/Commodity Price
Changes Risk Parameters Change during
day
Clearing Member
Client1 ClientK
![Page 53: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/53.jpg)
Computer Measurement Group, India 53Computer Measurement Group, India
Will standard architecture on commodity servers suffice?
Application Server2 CPU Database Server
2 CPU
Risk Management System?
![Page 54: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/54.jpg)
Computer Measurement Group, India 54Computer Measurement Group, India
Commodities Exchange: Online Risk Management
Computations:
• Position Monitoring, Mark to Market, P&L, Open Interest, Exposure Margins
• SPAN: Initial Margin (Scanning Risk), Inter-Commodity Spread Charge, Inter-Month Spread Charge, Short Option Margin, Net Option Value
• Collateral Management
Functionality is complexLet’s look at a simpler problem that reflects the same computational challenge & come back later
![Page 55: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/55.jpg)
Computer Measurement Group, India 55Computer Measurement Group, India
Workload Requirements
• Trades/Day : 10 Million
• Peak Trades/Sec : 300
• Traders : 1 Million
![Page 56: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/56.jpg)
Computer Measurement Group, India 56Computer Measurement Group, India
P&L Computation
Time Txn Stock Quantity Price Total Amount
t1 BUY Cisco 100 950 95,000
t2 BUY IBM 200 30 6000
t3 SELL Cisco 40 975 39,000
t4 SELL IBM 200 31 6200
Trader A
Profit(Cisco, t4) = -95000 + 39000 + (100-40)*970 = -56000 + 58200 = 2200
Current Cisco price is 970
In general Profit on a given stock S at time t:= sum of txn values up to time t +
(netpositions on stock at time t) * price of stock at time t
Buy txns take –ve value, sell +ve value
Biggest culprit
![Page 57: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/57.jpg)
Computer Measurement Group, India 57Computer Measurement Group, India
P&L Computation
int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stockint sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction valuesint profitperstock[MAXTRADERS][MAXSTOCKS];
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[r][t.stock];
profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;
profit[r] = profit[r]+ profitperstock[r][t.stock];
end loop
end loop
![Page 58: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/58.jpg)
Computer Measurement Group, India 58Computer Measurement Group, India
• Profit has to be kept updated for every price change– For all traders
• Inner Loop: 8 Computations– 4 Computations (+ + * +)– Loop Counter– 3 Assignments
• Actual Computational Complexity– 20 times as complex as displayed algorithm
• Number of traders: 1 million
P&L Computational Analysis
![Page 59: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/59.jpg)
Computer Measurement Group, India 59Computer Measurement Group, India
• SLA Expectation: 300 trades / sec
• Computations/trade– 8 computations x 1 million traders x 20 = 160
million
• Computations/sec = 160 million x 300 trades/sec– 48 billion computations/sec!
• Out of reach of contemporary servers that time!
Can we deliver within an IT budget?
P&L Computational Analysis
![Page 60: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/60.jpg)
Computer Measurement Group, India 60Computer Measurement Group, India
Test Environment
• Server• 8 Xeon 5560 cores• 2.8 GHz• 8 GB RAM
• OS: Centos 5.3• Linux kernel 2.6.18
• Programming Language : C• Compilers: gcc and icc
![Page 61: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/61.jpg)
Computer Measurement Group, India 61Computer Measurement Group, India
Test InputsNumber of Trades 1 MillionNumber of Traders 100,000Number of Stocks 100
Trade File Size 20 MB
Trades % Stock %20% 30%20 % 60%60% 10%
Trade Distribution
![Page 62: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/62.jpg)
Computer Measurement Group, India 62Computer Measurement Group, India
P&L Computation: Baselining
Trades/sec Overall Gain
Baseline Performance gcc 190
gcc –O3 323 70%
![Page 63: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/63.jpg)
Computer Measurement Group, India 63Computer Measurement Group, India 63
P&L Computation: Transpose
int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stockint sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction valuesint profitperstock[MAXTRADERS][MAXSTOCKS];
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[r][t.stock];
profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;
profit[r] = profit[r]+ profitperstock[r][t.stock];
end loop
end loop
Tra-der
Stock s1
Stocksi
r1
r2
r3
Trade t
Very Poor Caching
![Page 64: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/64.jpg)
Computer Measurement Group, India 64Computer Measurement Group, India
Matrix Layout
Trader Stock s1 . Stocksi
r1
r2
r3
Memory LayoutTrader r1 Trader r2 Trader r3 Trader r4
S1
S2
Si
S1
S2
S iS1
S2
S iS 1
S2
S i
![Page 65: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/65.jpg)
Computer Measurement Group, India 65Computer Measurement Group, India
Matrix Layout - Optimized
Optimized Memory LayoutStock S1 Stock S2 Stock S3 Stock S4
r1r2 rn r1r2 rnr1r2 rnr1r2 rn
Stock Trader r1
Trader r2
Trader r3
S1
S2
S3
![Page 66: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/66.jpg)
Computer Measurement Group, India 66Computer Measurement Group, India
P&L Computation: Transpose
int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXSTOCKS][MAXTRADERS]; // net positions per stockint sumtxnvalue[MAXSTOCKS][MAXTRADERS]; // net transaction valuesint profitperstock[MAXSTOCKS][MAXTRADERS];
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[t.stock][r];
profitperstock[t.stock][r] = sumtxnvalue[t.stock][r] + netpositions[t.stock][r] * t.price;
profit[r] = profit[r]+ profitperstock[t.stock][r];
end loop
end loop
Stock Tra-derr1
Tra-derri
s1
si
Trade t
Very Good Caching
![Page 67: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/67.jpg)
Computer Measurement Group, India 67Computer Measurement Group, India
P&L Computation: Transpose
Trades/sec
Overall Gain Immediate Gain
Baseline Performance gcc
190
gcc –O3 323 1.7X
Transpose of Trader/Stock
4750 25X 14.7X
Intel Compiler
Trades/sec
Overall Gain Immediate Gain
icc –fast (not –O3) 6850 36X 37%
![Page 68: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/68.jpg)
Computer Measurement Group, India 68Computer Measurement Group, India
P&L Computation: Use of Partial Sums
int profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXSTOCKS] [MAXTRADERS]; // net transaction valuesint profitperstock[MAXSTOCKS] [MAXTRADERS];
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[r][t.stock];
profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;
profit[r] = profit[r]+ profitperstock[r][t.stock];
end loop
end loop
This can be maintained cumulatively for the trader. Need not be per stock.
![Page 69: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/69.jpg)
Computer Measurement Group, India 69Computer Measurement Group, India 69
P&L Computation: Use of Partial sumsint profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXTRADERS]; // net transaction valuesint sumposvalue [MAXTRADERS]; // sum of netpositions * stock price int ltp[MAXSTOCKS]; // latest stock price (last traded price)
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer] − = t.quantity * t.price;
sumtxnvalue[t.seller] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];
sumposvalue[t.seller] − = t.quantity * ltp[t.stock];
loop for all traders r
sumposvalue[r] + = netpositions[t.stock] [r]* (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
ltp[t.stock] = t.price;
end loop
Monetary Value of all stock positionstime of trade
Trades/sec
Overall Gain Immediate Gain
Use of Partial Sum 9650 50X 41%
![Page 70: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/70.jpg)
Computer Measurement Group, India 70Computer Measurement Group, India
P&L Computation: Skip Zero Values
int netpositions [MAXSTOCKS] [MAXTRADERS];
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
endif
end loop
Majority of the values of this matrix are 0, thanks to hot stocks
Trades/sec
Overall Gain Immediate Gain
Skip Zero Values 10800 56X 12%
![Page 71: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/71.jpg)
Computer Measurement Group, India 71Computer Measurement Group, India
• There is a large percentage of cold stocks– Those which are held by very few traders
• In the last optimization an “if” check was added to avoid computation– If the trader does not hold the traded stock
• Is there any benefit if the trader record is not accessed at all?– We are computing for 100,000 traders
P&L Computation: Cold Stocks
![Page 72: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/72.jpg)
Computer Measurement Group, India 72Computer Measurement Group, India
P&L Computation: Sparse Matrix Representation
Stock A B C D E
s1 1 1 0 0 0
s2 1 1 1 0 0
s3 1 0 0 1 1
Flags Table – This Stock owned by who?
Updated in outer loopStock Count T0 T1 T2 . .
s1 2 A B 0 0 0
s2 3 A C B 0 0
s3 3 A E D 0 0
Traversed in outer loop
Traders Indexes/stock
![Page 73: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/73.jpg)
Computer Measurement Group, India 73Computer Measurement Group, India 73
P&L Computation: Sparse Matrix Representation
int profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXTRADERS]; // net transaction valuesint sumposvalue [MAXTRADERS]; // sum of netpositions * stock price int ltp[MAXSTOCKS]; // latest stock price (last traded price)
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer] − = t.quantity * t.price;
sumtxnvalue[t.seller] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];
sumposvalue[t.seller] − = t.quantity * ltp[t.stock];
loop for all traders r
sumposvalue[r] + = netpositions[t.stock] [r]* (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
ltp[t.stock] = t.price;
end loop
Traverse list of trader count for stock less than threshold
Trades/sec
Overall Gain Immediate Gain
Sparse Matrix 36000 189X 3.24X
![Page 74: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/74.jpg)
Computer Measurement Group, India 74Computer Measurement Group, India
P&L Computation: Clustering
struct TraderRecord { int profit; int sumtxnvalue int sumposvalue;} Trades/
secOverall Gain Immediate
Gain
Clustering 70000 368X 94%
int profit[MAXTRADERS];int sumtxnvalue [MAXTRADERS]; int sumposvalue [MAXTRADERS];
Poor caching for sparse matrix lists
Better caching performance!
![Page 75: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/75.jpg)
Computer Measurement Group, India 75Computer Measurement Group, India
P&L Computation: Precompute Price Difference
Trades/sec Overall Gain Immediate Gain
Clustering 75000 394X 7%
int netpositions [MAXSTOCKS] [MAXTRADERS];
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
Loop Invariant: Move to outside the loop
![Page 76: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/76.jpg)
Computer Measurement Group, India 76Computer Measurement Group, India
P&L Computation: Loop Unrolling
Trades/sec
Overall Gain Immediate Gain
Clustering 80000 421X 7%
#pragma unroll
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
![Page 77: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/77.jpg)
Computer Measurement Group, India 77Computer Measurement Group, India
Commodities Exchange: Online Risk Management
Trading System
RiskManagementSystem
CollateralFallsShort
Prevent Client/Clearing Member from Trading
OnlineTrades
AlertsClearing Member
Client1 ClientK
Initial Deposit of Collateral Long/Short Positions on
Contracts Contract/Commodity Price
Changes Risk Parameters Change during
day
![Page 78: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/78.jpg)
Computer Measurement Group, India 78Computer Measurement Group, India
P&L Computation: Batching of Trades
Trades/sec Overall Gain Immediate Gain
Batching of 100 trades 150000 789X 1.88X
Batching of 1000 trades 400000 2105X 2.67X
Batch n trades and use ltp of last trade // increases risk by a small delay
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
So far all this is with only one thread!!!
![Page 79: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/79.jpg)
Computer Measurement Group, India 79Computer Measurement Group, India
P&L Computation: Use of Parallel Processing
Trades/sec Overall Gain
Immediate Gain
OpenMP 1.2 million 5368X 2.55X
#pragma openmp with chunks (32 threads on 8 core Intel server)
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
![Page 80: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/80.jpg)
Computer Measurement Group, India 80Computer Measurement Group, India
P&L Computation: Summary of OptimizationsOptimization Trades/sec Immediate
GainOverall
Gain
Baseline gcc 190
gcc –O3 320 1.70X 1.7X
Transpose of Trader/Stock 4750 14.70X 25X
Intel Compiler icc –fast 6850 1.37X 36X
Use of Partial Sums 9650 1.41X 50X
Skip Zero Values 10,800 1.12X 56X
Sparse Matrix 36,000 3.24X 189X
Clustering of Arrays 70,000 1.94X 368X
Precompute Price Diff 75,000 1.07X 394X
Loop Unrolling 80,000 1.07X 421X
Batching of 100 Trades 150,000 1.88X 789X
Batching of 1000 Trades 400,000 2.67X 2105X
OpenMP 1,020,000 2.55X 5368X
Single Thread
8 CPU, 32 Threads
![Page 81: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/81.jpg)
Computer Measurement Group, India 81
BACKGROUNDLattice Boltzmann on GPU
![Page 82: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/82.jpg)
Computer Measurement Group, India 82Computer Measurement Group, India
2-D Square Lid Driven Cavity Problem
Moving Top Lid
L
L
X
Y
U
Fluid
Flow is generated by continuously moving top lid at a constant velocity.
![Page 83: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/83.jpg)
Computer Measurement Group, India 83Computer Measurement Group, India
Level 1
Time (ms) MGUPS Remarks
520727.1 5.034192 Simply ported CPU code to GPU.Structure Node & Lattice in GPU Global Memory.
/*CPU Code*/for(y=0; y<(ny-2); y++){ for(x=0; x<(nx-2); x++) { -- }}
/*GPU Code*//*for(int y=0; y<(ny-2); y++){*/
if(tid < (ny-2)){ for(x=0; x<(nx-2); x++) { -- } }
Replace outer loop Iterations with threads. Total Threads=(ny-2), Each thread working on (nx-2)
grid points. MGUPS = (GridSize x TimeIterations) / (Time x 1000000)
![Page 84: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/84.jpg)
Computer Measurement Group, India 84Computer Measurement Group, India
Level 1 (Cont.)
![Page 85: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/85.jpg)
Computer Measurement Group, India 85Computer Measurement Group, India
Level 2
Time (ms) MGUPS Remarks
115742 22.64899 Loop Collapsing
/*GPU Code Level 1*/
if(tid < (ny-2)){ for(x=0; x<(nx-2); x++) { for(y=0; y<(ny-2); y++)-- } }
/*GPU Code with Loop Fusion*/
if(tid < ((ny-2)*(nx-2))){ y = (tid/(nx-2))+1; x = (tid%(nx-2))+1;
--}
Collapsing of 2 nested loops into one to exhibit massive parallelism.
Total threads=[(ny-2)*(nx-2)], Now each thread working on 1 grid point.
![Page 86: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/86.jpg)
Computer Measurement Group, India 86Computer Measurement Group, India
About GPU Constant Memory
Can be used for data that will not change over the course of kernel execution.
Define constant memory using __constant__ cudMemcpyToSymbol will copy data to constant memory. Constant memory is cached. Constant memory is read-only. Just 64 KB.
SM 1 SM 2 SM 14
Global Memory
Constant Memory
Tesla C2075
![Page 87: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/87.jpg)
Computer Measurement Group, India 87Computer Measurement Group, India
Level 3
Time (ms) MGUPS Remarks
113061.8 23.186 Copied Lattice Structure in GPU Constant Memory
__constant__ Lattice lattice_dev_const[1];cudaMemcpyToSymbol(lattice_dev_const, lattice, sizeof(Lattice));
typedef struct Lattice{ int Cs[9]; int Lattice_velocities[9][2]; real_dt Lattice_constants[9][4]; real_dt ek_i[9][9]; real_dt w_k[9]; real_dt ac_i[9]; real_dt gamma9[9];}Lattice;
![Page 88: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/88.jpg)
Computer Measurement Group, India 88Computer Measurement Group, India
Level 4
Time (ms) MGUPS Remarks
40044.5 65.5 Coalesced Memory Access pattern for Node Structure
typedef struct Node /* AoS, (ny*nx) elements */{ int Type; real_dt Vel[2]; real_dt Density; real_dt F[9]; real_dt Ftmp[9];}Node;
Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]
Grid Point 0 Grid Point 1
![Page 89: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/89.jpg)
Computer Measurement Group, India 89Computer Measurement Group, India
Level 4 (Cont.)
![Page 90: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/90.jpg)
Computer Measurement Group, India 90Computer Measurement Group, India
Level 4 (Cont.)
Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]
Grid Point 0 Grid Point 1
T - 0
T - 1
(All Threads simultaneously accessing Density)
Stride
![Page 91: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/91.jpg)
Computer Measurement Group, India 91Computer Measurement Group, India
Level 4 (Cont.)
Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]
Grid Point 0 Grid Point 1
T - 0
T - 1
(All Threads simultaneously accessing Density)
Stride
Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]
T - 0
T - 1
(All Threads simultaneously accessing Density)
Coalesced Access pattern
Efficient access of global memory
Inefficient access of global memory
![Page 92: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/92.jpg)
Computer Measurement Group, India 92Computer Measurement Group, India
Level 4 (Cont.)
typedef struct Type{ int *val;}Type;typedef struct Vel{ real_dt *val;}Vel;typedef struct Density{ real_dt *val;}Density;typedef struct F{ real_dt *val;}F;typedef struct Ftmp{ real_dt *val;}Ftmp;
typedef struct Node_map{Type type;Vel vel[2];Density density;F f[9];Ftmp ftmp[9];
}Node_dev;
typedef struct Node /* AoS, (ny*nx) elements */{ int Type; real_dt Vel[2]; real_dt Density; real_dt F[9]; real_dt Ftmp[9];}Node;
![Page 93: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/93.jpg)
Computer Measurement Group, India 93Computer Measurement Group, India
Level 5
Time (ms) MGUPS Remarks
14492.6 180.9 Arithmetic Optimizations
for(int k=3; k<SPEEDS; k++){ //mk[k] = lattice_dev_const->gamma9[k]*mk[k]; //mk[k] = lattice_dev_const->gamma9[k] * mk[k] / lattice_dev_const->w_k[k]; mk[k] = lattice_dev_const->gamma9_div_wk[k]*mk[k];}
for(int i=0; i<SPEEDS; i++){ f_neq = 0.0; for(int k=0; k<SPEEDS; k++) { //f_neq += ((lattice_dev_const->ek_i[k][i] * mk[k]) / lattice_dev_const->w_k[k]); f_neq += lattice_dev_const->ek_i[k][i]*mk[k]; }}
![Page 94: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/94.jpg)
Computer Measurement Group, India 94Computer Measurement Group, India
Level 5 (Cont.)
![Page 95: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/95.jpg)
Computer Measurement Group, India 95Computer Measurement Group, India
Level 6
Time (ms) MGUPS Remarks
8309.662109 315.468903 Algorithmic Optimization
nnn vF ,,nFtmp
nnv ,
nF
Global barrier
Collision Streaming
Collision stores Ftmp to GPU Global Memory. Streaming loads Ftmp from GPU Global Memory. Global Memory Load/Store operations are
expensive.
![Page 96: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/96.jpg)
Computer Measurement Group, India 96Computer Measurement Group, India
Level 6 (Cont.)
Collision Streaming
Pull Ftmp from Neighbors needs Synchronization.
![Page 97: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/97.jpg)
Computer Measurement Group, India 97Computer Measurement Group, India
Level 6 (Cont.)
Collision Streaming
Instead Push Ftmp to Neighbors – No need of Synchronization
![Page 98: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/98.jpg)
Computer Measurement Group, India 98Computer Measurement Group, India
Level 6 (Cont.)
Collision & Streaming can be one kernel. Saves one Load/Store from/to Global Memory.
nnn vF ,,
nnv ,
nFtmp
nFtmp
nF
![Page 99: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/99.jpg)
Computer Measurement Group, India 99Computer Measurement Group, India
Optimizations Achieved on GPU using CUDA
Levels Time (ms)
MGUPS (Million Grid Updates Per Second)
Remarks
1 520727.1 5.034192Simply ported CPU code to GPU.Structure Node & Lattice in GPU Global Memory.
2 115742 22.64899 Loop Collapsing
3 113061.8 23.186 Copied Lattice Structure in GPU Constant Memory
4 40044.5 65.5 Coalesced Memory Access pattern for Node Structure
5 14492.6 180.9 Arithmetic Optimizations
6 8309.662109 315.468903 Algorithmic Optimization
CUDA Card: Tesla C2075 (448 Cores, 14 SM, Fermi, Compute 2.0)
![Page 100: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/100.jpg)
Computer Measurement Group, India 100
Recap
• Part – I – A sample domain problem– Hardware & Software
• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation
- 100 -
![Page 101: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/101.jpg)
Computer Measurement Group, India 101
Closing Comments• OLTP applications seldom require HPC technologies
– Unless it is an application that needs to respond in microseconds• Algo trading etc
• Can HPC technologies be used to speed up my data-transformation (ETL/ELT) and reporting workloads?
– Sure – you have to let go the ease of using 3rd party products & databases• If you don’t want to –customizing a specific bottleneck process could help
– Stay tuned to companies innovating in this space – • e.g SQREAM – implements databases operations on GPU’s
• Investing in an HPC cluster and technologies not enough– Also investing people who understand
• Underlying technologies• Applications
- 101 -
![Page 102: Computer Measurement Group, India 0 0 HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization](https://reader037.vdocument.in/reader037/viewer/2022110208/56649dbd5503460f94aaf36d/html5/thumbnails/102.jpg)
Computer Measurement Group, India 102Computer Measurement Group, India 102
www.cmgindia.org
Q&A