epanalytics

23

Upload: dell-enterprise

Post on 20-Aug-2015

542 views

Category:

Data & Analytics


0 download

TRANSCRIPT

ARM In HPCLaura Carrington, PhD

VP of ResearchEP Analytics, Inc.

Background

High Performance Computing (HPC)

Unique system architecture– Large-scale - up to hundreds of thousands of cores– Requires specialized interconnect - tightly-coupled computations

between cores communicate on msec frequency– Hardware accelerators combined with host processor

(e.g. INTEL Xeon Phi, NVIDIA GPU, Convey FPGAs)

Important application domains – large span of resource requirements– Big data analytics– Engineering/CAD– Bioinformatics/genomics/pharmaceuticals– Climate/weather– Physical/chemical sciences– Aerospace and automotive

simulations

HPC Software Characteristics

Performance dependent on different levels of parallelism– Multi-processor/Multi-core– Thread level parallelism– Data parallelism– Instruction level parallelism

Tasks/threads are tightly synchronized with each otherLarge requirement for efficient data movement (e.g. large memory bandwidth, high speed intra-node, and inter-node communication)

Workloads typically not suitable for cloud-based

platformsData movement causes inefficient use of processors (e.g. ~7% peak performance of system)

Batch scheduled on dedicated nodes

Hardware & Software Challenges in HPC

Running calculations on larger number of cores is main challenge – no system is big enough– Computational scientists want to add more science details to calculations

resulting in larger problems

– Existing scientific calculations demand higher resolution – provide more detail (e.g. for design, weather, etc.)

Key constraints in designing larger HPC systems– Investment in software large – decades of application development in

“legacy” applications (similar for ISV)– Solutions need to exploit different types of parallelism with minimal

investment in development.• Heterogeneous architectures with accelerators require large investment in

development without performance guarantees

– Power constraint – operational cost dominating overall cost of total purchase

Demands for better energy efficiency

How is HPC addressing demands for energy-efficient computing

Two issues : power constraint and energy efficiency (e.g. limit & cost of operations)

Potential solutions:– Wait around for traditional processors to improve

energy-efficiency– Use hardware accelerators– Use smaller/less capable cores designed for low

power draw– Some combination of the above

Where does ARM fit in?

The Big Idea – Many ARM Cores

Why many simple cores instead of a few brawny cores?– Energy Efficiency – multiple computations for given power draw

(FLOPS/Watt)

– Cost efficiency – more computations for less money(FLOPS/$)

– Physical footprint – more computational capability for a machine room space (FLOPS/m2)

Why ARM in particular?– Popular in mobile/embedded markets – commodity lowers price– Engineered for energy efficiency & low power draw

Implications of Many ARM Designs

Memory subsystem (more of) a bottleneck– Total memory BW is limited – lower memory BW per core– Smaller caches (probably)

Different types of parallelism needed– Shorter SIMD vectors/vector instructions– Simpler HW – less logic to extract instruction level parallelism– Throughput needs to come from thread level parallelism

Pressure on interconnect– More threads – more information passed between threads across

the network

Practical Limitations of ARMv7 for HPC

ARMv7 (32-bit):

32-bit – limitation of 4GB address space per process– Becomes an issue for shared memory thread based programming models

Calculations in double precision take huge performance penalty– IEEE-754 floating point double precision not supported– No double precision SIMD/vector instructions– A hurdle to software portability

ARMv8 to the rescue?– 64-bit – address space issue resolved– Full IEEE-754 compliance – no performance penalty for double precision– Double precision SIMD/vector instructions

EP Analytics’ Approach

Will ARM work for HPC?

What does it need to do to win:

Improve energy efficiency with minimal performance impact

Minimal development cost to move applications onto system

How can we know if it will work?

Need detailed analysis of the hardware and software limitations of the system before investing

Understand the computations/workloads that could benefit from ARM’s energy efficiency

EP Analytics’ Approach workload-hardware analysis for clients

EP Analytics methodology to is identify the best

hardware for a given workload.Utilize in-house specialized tools to perform detailed power & performance

characterization of the software/applications and the hardware– Tools optimized for tasks enable them to work efficiently on large scale

systems and applications– Determine computations within applications that are not affected by (or are

sensitive to) the limitations of the hardware

+ Identifies sections source code that map well to hardware & why

+ Allows efficient porting & optimization strategies for new

hardware

Company’s researchers have more than a decade of experience in

mapping applications to optimal hardware for clients– Inform procurement decisions made by DoD– Develop a set of best practices to port legacy applications to emerging

architectures– Identify optimization strategies for a given architecture– Aid in system acceptance and hardware debugging

EP Analytics’ Approach

Convolution/Modeling MethodMap Application Signature to

Machine Profile

Machine ProfileSystem behavior in the presence of

some set of fundamental operations – power, performance

EP Analytics developed tools & techniques

Application signature Detailed summaries of the

fundamental operations to be carried out by the application

EP Analytics’ Approach

Analysis process utilizing detailed characterization of hardware & software (power & performance models)

– Investigate what set of properties are needed to explain performance and energy variability?

Develop specialized tools to capture properties of the computation/software’s requirements used as model input– Tools based on binary instrumentation tied to source code– Static analysis on x86 binaries (PEBIL) and ARM binaries (EPAX)

• Floating point operations, memory accesses, working set sizes

Both tools developed by the researchers from EP Analytics, Inc.Result of analysis (e.g. model output) – performance of computations and energy required to complete the computations

ARM core analysis determines the performance and energy of a range of scientific computations on ARM core relative to Intel

Sandy Bridge

Analysis of ARMv7 Energy-efficiency for HPC computationsGoal: Predict what the energy behavior is for series of

computational patterns (understand energy profile of applications built from these patterns)

Develop model utilizes computational characterization as input output energy required for computation

Models capture all interesting trends in energy variation

– Model uses only static metrics collected via EPAX tool

Energy of ARM relative to Intel SandyBridge

Analysis of ARMv7 Energy-efficiency for HPC computationsGoal: Predict what the energy behavior is for series of

computational patterns (understand energy profile of applications built from these patterns)

Develop model utilizes computational characterization as input output energy required for computation

These computations use ~5-10 times less energy on A15 than on SB!models capture this behavior.

Models capture all interesting trends in energy variation

– Model uses only static metrics collected via EPAX tool

Energy of ARM relative to Intel SandyBridge

Improved energy-efficiency running at SP due to performance hit running at DP

Looking Ahead to ARMv8Can it work for HPC?

ARMv8 Hardware initial analysis

ARMv8 hardware recently became available– Early access provided by Dell on 64-bit ARM SOC processor

Design point summary (not to scale)

Cortex-A9Cortex-A15

64-bit ARM SOC

Sandy Bridge

increasing performance & power consumption

ARM v8

Performance/Energy Analysis for HPC computations

Run a suite of computations of with increasing memory footprint (e.g. L1, L2, L3, MM)– Across multiple ARM generations. Results normalized to INTEL Sandy Bridge

More details in: Characterizing the Performance-Energy Tradeoff of Small ARM Cores in HPC Computation . Laurenzano, M., Tiwari, A., Jundt, A., Peraza, J., Ward, W., Campbell, R., and Carrington, L. EuroPar 2014.

Double Precision

Single Precision

Each generation of ARM come closer to Sandybridge in performance ~20X slowdown ~2X slowdown

Performance impact of ARM core64-bit ARM SOC Performance

Performance/Energy Analysis for HPC computations

Run a suite of computations of with increasing memory footprint (e.g. L1, L2, L3, MM)– Across multiple ARM versions. Results normalized to SandyBridge

Double Precision

Single Precision

Energy savings of ARM core

Each generation of ARM achieves greater energy savings. Even A9 achieves energy savings for some computations.

Type of computation and memory requirement dictate energy savings potential.

64-bit ARM SOC Energy

Conclusions: ARM potential for HPC

ARM on HPC applications– Performance and energy vary significantly and dependent on calculation.

– Variability has a lot to do with FP/SIMD and cache/memory features– EP Analytics’ methodology utilizes these features to accurately pinpoint

which calculations benefit from ARM hardware and by how much

Looking ahead to ARMv8– Significant energy improvements over Sandy Bridge

– 64-bit ARM SOC shows significant performance improvements over Cortex-A15 with similar energy profiles

– Ongoing work looks at scaling the HPC applications to multiple nodes

More detailed data: http://epanalytics.com/data/euro-par2014

www.epanalytics.com

Thank you!