epanalytics
TRANSCRIPT
High Performance Computing (HPC)
Unique system architecture– Large-scale - up to hundreds of thousands of cores– Requires specialized interconnect - tightly-coupled computations
between cores communicate on msec frequency– Hardware accelerators combined with host processor
(e.g. INTEL Xeon Phi, NVIDIA GPU, Convey FPGAs)
Important application domains – large span of resource requirements– Big data analytics– Engineering/CAD– Bioinformatics/genomics/pharmaceuticals– Climate/weather– Physical/chemical sciences– Aerospace and automotive
simulations
HPC Software Characteristics
Performance dependent on different levels of parallelism– Multi-processor/Multi-core– Thread level parallelism– Data parallelism– Instruction level parallelism
Tasks/threads are tightly synchronized with each otherLarge requirement for efficient data movement (e.g. large memory bandwidth, high speed intra-node, and inter-node communication)
Workloads typically not suitable for cloud-based
platformsData movement causes inefficient use of processors (e.g. ~7% peak performance of system)
Batch scheduled on dedicated nodes
Hardware & Software Challenges in HPC
Running calculations on larger number of cores is main challenge – no system is big enough– Computational scientists want to add more science details to calculations
resulting in larger problems
– Existing scientific calculations demand higher resolution – provide more detail (e.g. for design, weather, etc.)
Key constraints in designing larger HPC systems– Investment in software large – decades of application development in
“legacy” applications (similar for ISV)– Solutions need to exploit different types of parallelism with minimal
investment in development.• Heterogeneous architectures with accelerators require large investment in
development without performance guarantees
– Power constraint – operational cost dominating overall cost of total purchase
Demands for better energy efficiency
How is HPC addressing demands for energy-efficient computing
Two issues : power constraint and energy efficiency (e.g. limit & cost of operations)
Potential solutions:– Wait around for traditional processors to improve
energy-efficiency– Use hardware accelerators– Use smaller/less capable cores designed for low
power draw– Some combination of the above
Where does ARM fit in?
The Big Idea – Many ARM Cores
Why many simple cores instead of a few brawny cores?– Energy Efficiency – multiple computations for given power draw
(FLOPS/Watt)
– Cost efficiency – more computations for less money(FLOPS/$)
– Physical footprint – more computational capability for a machine room space (FLOPS/m2)
Why ARM in particular?– Popular in mobile/embedded markets – commodity lowers price– Engineered for energy efficiency & low power draw
Implications of Many ARM Designs
Memory subsystem (more of) a bottleneck– Total memory BW is limited – lower memory BW per core– Smaller caches (probably)
Different types of parallelism needed– Shorter SIMD vectors/vector instructions– Simpler HW – less logic to extract instruction level parallelism– Throughput needs to come from thread level parallelism
Pressure on interconnect– More threads – more information passed between threads across
the network
Practical Limitations of ARMv7 for HPC
ARMv7 (32-bit):
32-bit – limitation of 4GB address space per process– Becomes an issue for shared memory thread based programming models
Calculations in double precision take huge performance penalty– IEEE-754 floating point double precision not supported– No double precision SIMD/vector instructions– A hurdle to software portability
ARMv8 to the rescue?– 64-bit – address space issue resolved– Full IEEE-754 compliance – no performance penalty for double precision– Double precision SIMD/vector instructions
Will ARM work for HPC?
What does it need to do to win:
Improve energy efficiency with minimal performance impact
Minimal development cost to move applications onto system
How can we know if it will work?
Need detailed analysis of the hardware and software limitations of the system before investing
Understand the computations/workloads that could benefit from ARM’s energy efficiency
EP Analytics’ Approach workload-hardware analysis for clients
EP Analytics methodology to is identify the best
hardware for a given workload.Utilize in-house specialized tools to perform detailed power & performance
characterization of the software/applications and the hardware– Tools optimized for tasks enable them to work efficiently on large scale
systems and applications– Determine computations within applications that are not affected by (or are
sensitive to) the limitations of the hardware
+ Identifies sections source code that map well to hardware & why
+ Allows efficient porting & optimization strategies for new
hardware
Company’s researchers have more than a decade of experience in
mapping applications to optimal hardware for clients– Inform procurement decisions made by DoD– Develop a set of best practices to port legacy applications to emerging
architectures– Identify optimization strategies for a given architecture– Aid in system acceptance and hardware debugging
EP Analytics’ Approach
Convolution/Modeling MethodMap Application Signature to
Machine Profile
Machine ProfileSystem behavior in the presence of
some set of fundamental operations – power, performance
EP Analytics developed tools & techniques
Application signature Detailed summaries of the
fundamental operations to be carried out by the application
EP Analytics’ Approach
Analysis process utilizing detailed characterization of hardware & software (power & performance models)
– Investigate what set of properties are needed to explain performance and energy variability?
Develop specialized tools to capture properties of the computation/software’s requirements used as model input– Tools based on binary instrumentation tied to source code– Static analysis on x86 binaries (PEBIL) and ARM binaries (EPAX)
• Floating point operations, memory accesses, working set sizes
Both tools developed by the researchers from EP Analytics, Inc.Result of analysis (e.g. model output) – performance of computations and energy required to complete the computations
ARM core analysis determines the performance and energy of a range of scientific computations on ARM core relative to Intel
Sandy Bridge
Analysis of ARMv7 Energy-efficiency for HPC computationsGoal: Predict what the energy behavior is for series of
computational patterns (understand energy profile of applications built from these patterns)
Develop model utilizes computational characterization as input output energy required for computation
Models capture all interesting trends in energy variation
– Model uses only static metrics collected via EPAX tool
Energy of ARM relative to Intel SandyBridge
Analysis of ARMv7 Energy-efficiency for HPC computationsGoal: Predict what the energy behavior is for series of
computational patterns (understand energy profile of applications built from these patterns)
Develop model utilizes computational characterization as input output energy required for computation
These computations use ~5-10 times less energy on A15 than on SB!models capture this behavior.
Models capture all interesting trends in energy variation
– Model uses only static metrics collected via EPAX tool
Energy of ARM relative to Intel SandyBridge
Improved energy-efficiency running at SP due to performance hit running at DP
ARMv8 Hardware initial analysis
ARMv8 hardware recently became available– Early access provided by Dell on 64-bit ARM SOC processor
Design point summary (not to scale)
Cortex-A9Cortex-A15
64-bit ARM SOC
Sandy Bridge
increasing performance & power consumption
ARM v8
Performance/Energy Analysis for HPC computations
Run a suite of computations of with increasing memory footprint (e.g. L1, L2, L3, MM)– Across multiple ARM generations. Results normalized to INTEL Sandy Bridge
More details in: Characterizing the Performance-Energy Tradeoff of Small ARM Cores in HPC Computation . Laurenzano, M., Tiwari, A., Jundt, A., Peraza, J., Ward, W., Campbell, R., and Carrington, L. EuroPar 2014.
Double Precision
Single Precision
Each generation of ARM come closer to Sandybridge in performance ~20X slowdown ~2X slowdown
Performance impact of ARM core64-bit ARM SOC Performance
Performance/Energy Analysis for HPC computations
Run a suite of computations of with increasing memory footprint (e.g. L1, L2, L3, MM)– Across multiple ARM versions. Results normalized to SandyBridge
Double Precision
Single Precision
Energy savings of ARM core
Each generation of ARM achieves greater energy savings. Even A9 achieves energy savings for some computations.
Type of computation and memory requirement dictate energy savings potential.
64-bit ARM SOC Energy
Conclusions: ARM potential for HPC
ARM on HPC applications– Performance and energy vary significantly and dependent on calculation.
– Variability has a lot to do with FP/SIMD and cache/memory features– EP Analytics’ methodology utilizes these features to accurately pinpoint
which calculations benefit from ARM hardware and by how much
Looking ahead to ARMv8– Significant energy improvements over Sandy Bridge
– 64-bit ARM SOC shows significant performance improvements over Cortex-A15 with similar energy profiles
– Ongoing work looks at scaling the HPC applications to multiple nodes
More detailed data: http://epanalytics.com/data/euro-par2014