porting financial market applications to the cell ...cavazos/cisc879-spring2008/kaniks.pdf ·...
Post on 25-Mar-2020
7 Views
Preview:
TRANSCRIPT
CISC 879 : Software Support for Multicore Architectures
Presented By: Kanik SemDept of Computer & Information Sciences
University of Delaware
Porting Financial Market Applications tothe Cell Broadband Engine Architecture
John Easton, Ingo Meents, Olaf Stephen, Horst Zisgen, Sei Kato
CISC 879 : Software Support for Multicore Architectures
Outline
• Why Cell B.E. for financial markets?• Porting strategies for the Cell B.E. platform• Performance results• Mixed-precision workloads• Tying it all together• Conclusions
CISC 879 : Software Support for Multicore Architectures
Why Cell B.E. for financial markets?
• Potential for dramatic impact on financialapplications
• Application codes ported to the Cell• Optimized codes to fully exploit Cell• Performance improvements of almost 40x
CISC 879 : Software Support for Multicore Architectures
A description of the application
• Code used to price a European Option.• Model based on Monte Carlo simulation technique.• Need to generate a large number (200,000,000 in this case) of
uniform, pseudo-random numbers.• Using the random numbers generated, execute the financial
model.
CISC 879 : Software Support for Multicore Architectures
Porting strategies for Cell
• Recompilation of existing code for Cell• XLC better than gcc
• Make some structural changes• Framework to start separate threads on each SPU.• Splitting RNG across all cores.
• Make functional changes to the code.• Re-engineered functions to exploit vectorization on SPU cores.
CISC 879 : Software Support for Multicore Architectures
Analysis of the original code
%time Seconds Calls Function name 62.70 118.32 200000000 getRandom() 37.18 70.16 1 simulateEuropeanOptionValue()
0.14 0.27 1 hpcMonteCarlo::random() 0.00 0.00 2 hpcBlackScholes()
SDK for Cell provides optimized RNG.Can generate 64 number generators at once on Cellblade.Use gettimeofday() function.
CISC 879 : Software Support for Multicore Architectures
Initial performance results
To run the performance tests, the following parameterswere used :
• Compiler used: spuxlc, ppuxlc
• Compiler optimization setting: -03 –qstrict
• Random-number generation method: sdk
• Precision: single
• Number of evaluations: 200,000,000
CISC 879 : Software Support for Multicore Architectures
Initial performance results
Performance by number of SPUs (single precision)
Number of SPUs
Elapsed time (seconds)2.4 GHz
Cell/B.E. processor
(measured)
Elapsed time (seconds)3.2 GHz
Cell/B.E. processor
(estimated)
Speedup
1 65.7 49.27 1 2 32.9 24.6 1.99
3 21.9 16.42 3 4 16.4 12.3 4
5 13.18 9.88 4.98
6 10.9 8.17 6.02 7 9.4 7.05 6.98
8 8.2 6.15 8.01 9 7.3 5.4 9
10 6.6 4.95 9.95 11 6 4.5 10.95
12 5.5 4.12 11.94
13 5.1 3.8 12.88 14 4.7 3.52 13.97
15 4.4 3.3 14.93 16 4.1 3.07 16.02
CISC 879 : Software Support for Multicore Architectures
Initial performance results
CISC 879 : Software Support for Multicore Architectures
Double Precision
Organizations in financial markets require double-precisioncalculations.
Initial target marketplace for Cell does not need this.
Initial implementation of Cell provides limited double-precisionsupport in hardware
Single-precision Fully pipelinedDouble-precision Partially pipelined
CISC 879 : Software Support for Multicore Architectures
Performance results
Performance by number of SPUs (double -precision)
Number of SPUs Elapsed time
(seconds)2.4 GHz Cell/B.E.
processor
(measured)
Elapsed time
(seconds)3.2 GHz Cell/B.E.
processor
(estimated)
Speedup
1 157.3 117.9 1
2 78.6 58.9 2
3 52.4 39.3 3
4 39.3 29.47 4
5 31.49 23.61 4.99
6 26.25 19.68 5.99
7 22.5 16.8 6.99
8 19.7 14.7 7.98
9 17.5 13.12 8.98
10 15.78 11.8 9.96
11 14.3 10.7 11
12 13.1 9.82 12
13 12.1 9.1 13
14 11.3 8.47 13.92
15 10.5 7.87 14.98
16 9.9 7.42 15.89
CISC 879 : Software Support for Multicore Architectures
Mersenne-Twister
• Run time with Mersenne-Twister (without optimization): 5 sec• Run time with the Cell/B.E. SDK: 4.1 sec
Mechanisms to improve the performance still further :Optimize Mersenne-Twister code for threading framework.Rewrite the code to utilize the SIMD capabilities of SPUs.
Performance comparison between Cell/B.E. SDK and Mersenne -Twister random -number generators
Precision Runtime
(seconds) SDK
RNG (2.4Ghz)
Runtime
(seconds)
Mersenne -Twister RNG (2.4
GHz)
Runtime
(seconds)
Mersenne -Twister RNG 3.2
GHz (estimated)
Single 4.1 1.02 0.76
Double 9.9 2.47 1.85
CISC 879 : Software Support for Multicore Architectures
Mixed-precision workloadsMixed-Precision:Only those parts that actually need double-precision arecalculated using double-precision.
Disadvantage:Makes for a slight increase in the programming effortneeded
Identify parts of code which use this sort of precision Make the appropriate changes to the code.
Advantage: Performance improvement.
CISC 879 : Software Support for Multicore Architectures
Mixed-precision workloads
The two methods of applying mixed-precision to our codeare:
(1) Concatenating two single-precision random variables.
(2) Generate one single-precision random variable and thendoing a double-precision division.
CISC 879 : Software Support for Multicore Architectures
Mixed-precision workloads
# SPU CC_DP_MT CC_DP_SDK M_DP_MT SP_MT SP_SDK
1 40.33 40.33 45.76 12.01 11.16
2 20.33 20.33 22.88 6.06 5.70
3 13.56 13.56 15.26 4.05 3.80
4 10.17 10.17 11.44 3.04 2.85
5 8.13 8.13 9.16 2.43 2.29
6 6.78 6.78 7.64 2.03 1.91
7 5.82 5.82 6.55 1.75 1.64
8 5.09 5.09 5.75 1.53 1.44
9 4.53 4.52 5.11 1.36 1.28
10 4.08 4.08 4.60 1.22 1.15
11 3.70 3.70 4.18 1.11 1.05
12 3.40 3.39 3.84 1.02 0.96
13 3.14 3.14 3.54 0.94 0.89
14 2.92 2.92 3.29 0.88 0.83
15 2.72 2.72 3.07 0.82 0.78
16 2.52 2.53 2.88 0.77 0.73
• CC_DP_MT = Concatenation Double-Precision Mersenne-Twister• CC_DP_SDK = Concatenation Double-Precision SDK• M_DP_MT = Division Double-Precision Mersenne-Twister• SP_MT = Single-Precision Mersenne-Twister• SP_SDK = Single-Precision SDK
CISC 879 : Software Support for Multicore Architectures
Mixed-precision workloads
CISC 879 : Software Support for Multicore Architectures
Mixed-precision workloads
Additional optimization techniques :
• Unrolling more parts of Mersenne-Twister RNG.
• Additional software pipelining by parallelizing computation.
• Introducing new variables to eliminate dependencies.
• Pre-calculating some items:a[0]=<something>;for (i=0;i<N;i++){sinf4(a[0]) ;sinf4(a[i+1));......}
CISC 879 : Software Support for Multicore Architectures
Intel optimizations
• A “master” thread forks “slave” threads to perform RNG.• “master” thread part of the Cell/B.E. code that runs on PPU• “slave” threads parts that run on the SPUs.
Difference:• Work scheduled by the OpenMP runtime shares same cores as the
OS threads.• The SPUs on the Cell/B.E. version are not running the operating
system. This enables them to be used entirely to run the applicationcode.
CISC 879 : Software Support for Multicore Architectures
Intel optimizations
System/CPU Operating System Compiler No. of Threads (Cores)
Speed (GHz) 1 2 4 8
x3550/3.0 Red Hat Linux Intel ICPC 31.76 15.9 8.46 -
x336 / 2.8 Red Hat Linux Intel ICPC 43.27 30.02 22.62 -
HS21 / 2.33 Fedora Core 6 gcc 43.38 21.74 10.88 8.26
CISC 879 : Software Support for Multicore Architectures
Tying it all together
CISC 879 : Software Support for Multicore Architectures
Future Work
Results achieved so far are on a system that many viewas being unsuitable for Financial Markets users.
• “Enhanced Double-Precision” version of the CellBroadband Engine technology.
• Systems based on Cell/B.E. technology are an excellentplatform for Financial Markets applications.
CISC 879 : Software Support for Multicore Architectures
Getting the most performance out ofCell/B.E. technology
Offload as much of the computation onto the SPUs aspossible.
Write the SIMD code yourself rather than relying on thecompiler to do it.XLC provides “auto-SIMDize”This may not be a good approximation.
In certain situations, you might find that starting fromscratch is a much quicker way to implement applicationcode.
CISC 879 : Software Support for Multicore Architectures
Conclusions
Reasons for general-purpose processors make up themajority of the computational infrastructures :
(1) Huge numbers of systems based on these processors.
(2) Large supply of professionals skilled, this leads tolower skills costs.
(3) A lot of application development tooling.
(4) The relatively “easy” code porting to these platforms.
CISC 879 : Software Support for Multicore Architectures
Conclusions
“ESOTERIC” technologies: Offer high performance for their chip area. Consume much less power per computation.
Disadvantages:(1) Skills to program them are rare and, hence, expensive.(2) Lack of application development tooling.(3) The “porting” process is generally both slow and costly.
CISC 879 : Software Support for Multicore Architectures
Conclusions
Advantages of Cell/B.E. technology:
(1) Consumes less power, space and cooling(2) High computational power.(3) Better data movement and manipulation abilities.(4) A number of strong customer proof points.(5) Support from key Independent Software Vendors(6) Results of experiments such as this one.
CISC 879 : Software Support for Multicore Architectures
Questions….
Comments….
Caveats ….
top related