highly parallel applications and their architectural...

Highly Parallel Applications and Their Architectural Needs

Dr. Pradeep K DubeyIEEE Fellow and Director of Parallel Computing Lab

Intel Corporation

SAAHPC’11

Notice and Disclaimers

Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice.

All products, dates, and figures are preliminary for planning purposes and are subject to change without notice.

Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.

Intel® Itanium®, Xeon™, Pentium®, Intel SpeedStep® and Intel NetBurst® , Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2011, Intel Corporation. All rights reserved.

*Other names and brands may be claimed as the property of others..

[email protected] July 2011

http://www.intel.com/

Optimization Notice – Please read

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

Intel recommends that you evaluate other compilers to determine which best meet your

requirements.


Highly Parallel Applications

Can compute tame the beast of massive, unstructured, dynamic

datasets to enable real-time simulation and analytics?


http://physbam.stanford.edu/~fedkiw/animations/bag_pull.avi

Measures of Efficiency

Programmer productivity

Architectural impact of programmability could be even

bigger than that of power-performance

Maximize Performance / Effort ratio

Architectural efficiency

Sustaining performance growth under fixed cost-power

Performance / Peak flops (or bandwidth)

Cluster / datacenter scalability for real-time services

5

Absolute performance matters: efficiency matters no less!

Arch efficiency is about exploiting order and lowering overheads


Heterogeneous Computing Platform-driven

Application-driven Our focus

Degree of Heterogeneity

Compute Model A Compute Model B

Memory Mgmt: Explicit-Implicit

Data Parallelism: Explicit-Implicit

Threads: Latency-Throughput

Performance asymmetry only

Application 1 Application 2


From Multicore to Manycore

N

PP

S

)1(

1

N

PKP

S

N

)1(

1

1

)1(,1

N

N

NK

KNPSFor

Many-Core makes sense for workloads

with high enough “P “- parallel

component - for simplicity, we call these

Highly Parallel

S = speedup, P = parallel fraction, # of Cores = N, Kn = single thread performance (single core/multicore)

Many-Core

7

Multi-CoreSingle-Core


Multicore Manycore

• More cores and threads Simpler core

More latency tolerance

• More compute density Data-Parallel / SIMD

• More memory BW Lower B/F, lower capacity

Less cache/core


Intel® Many Integrated Core (MIC)

1st Intel® MIC product

on 22nm

>50 cores

Future MIC productsSoftware Development

Platform for MIC

“Knights Ferry” “Knights Corner” Future “Knights”

• Extending IA to flexible, fully programmable general purpose Many-Core

computing as co-processor

• Common IA programming models, languages, techniques and software

tools with Intel® Xeon® processors

• Industry leading Performance for highly parallel workloads

• Higher Efficiency than other attached solutions

• Optimized Efficiency for a heterogeneous solution in combination with

Intel Xeon processors


Intel® MIC Architecture Programming

Intel® Xeon®

processor

family

Intel® Xeon ®

processor

Intel® MIC

architecture

co-processor

Single Source

Compilersand Runtimes Common with Intel® Xeon®

• Languages

• C, C++, Fortran compilers

• Intel developer tools and libraries

• Coding and optimization techniques

• Ecosystem support

Eliminates Need for Dual Programming Architecture


Heterogeneous Programming

ArBB

MKL

TBB

C++/FTN

Cilk Plus

OpenCL

OpenMP

ArBB *

TBB

C++/FTN

Cilk Plus

MKL *

CELO

Programming MIC is the same as programming a CPU

OpenMP

OpenCL *

PCIe

PC

Ie

CPU Executable MIC Native

Executable

Heterogeneous

Compute

Para

llel

Com

pute

Para

llel

Com

pute

MYO / XN


Learning so far


Scaling with cores

For smaller problem sizes, Intel® Xeon® processor outperforms Knights Ferry (KNF), while for medium-

large problem sizes, KNF outperforms Intel Xeon processor (up to 2.5X)For performance details, refer to: “Designing and Dynamically Load Balancing Hybrid LU for Multi/Many-core” ISC’11 by Michael

Deisher, Mikhail Smelyanskiy, Brian Nickerson, Victor W. Lee, Michael Chuvelev, Pradeep Dubey


Benefit of big cores

When parallelism is insufficient, big-cores significantly outperform small cores

14

For performance details, refer to: “Designing and Dynamically Load Balancing Hybrid LU for Multi/Many-core” ISC’11 by Michael



Scaling further with cores …

big-and-small together

15

Dynamic partitioning across Intel Xeon and Intel MIC can pay off!

Hybrid LU Performance

For performance details, refer to: “Designing and Dynamically Load Balancing Hybrid LU for Multi/Many-core” ISC’11 by Michael



Platform scaling with cards

Hybrid scaling with cores and cards

16

System used: Dual socket Intel Xeon processor DP X5680 (Westmere-EP), total of 12 cores running at 3.33 GHz), 24 GB RAM, and it runs RedHat

Enterprise Edition Linux 6, four KNF cards attached over PCIe, each with: 30cores, at 1.05GHz, with 1GB GDDR.


Scaling with threads

0

10

20

30

40

50

60

70

80

90

1t

2t

4t

8t

16t

32t

64t

128t

Low-Overhead Barrier

Default Barrier

Manycore research simulation of

Conjugate Gradient kernel

Note: Illustrative scalability performance projection

above only based on a research manycore simulation

Threading overheads matter with more threads and should be kept low

Measured performance of Forward

Solver kernel

Low-overhead of threading primitives

help MIC achieve crossover wrt Xeon

at low matrix sizes (e.g., 8K below)

17

System used: Dual socket Intel Xeon processor DP X5680 (Westmere-EP), total of 12 cores running at 3.33 GHz), 24 GB RAM, and it runs Red Hat

Enterprise Edition Linux 6, KNF card attached over PCIe, with: 32cores, at 1.2GHz, with 2GB GDDR.


Scaling further with threads …

Redesigned data-structures

System used: Dual socket Intel Xeon processor DP X5680 (Westmere-EP), total of 12 cores running at 3.33 GHz), 96 GB RAM, and it

runs SusE Enterprise Edition Linux 11. KNF card attached over PCIe, 32cores, 1.2GHz, 2GB of GDDR

Lock-free data structures can speed-up tree insertions 12x-19xIntel® MIC offers 2x additional performance boost over Intel® Xeon


Scaling with SIMD

SIMD: most-efficient form of dense computation

19

System configuration used for performance data above: single-core performance obtained on 3.3GHz Intel Core i7 X980 (Westmere). 12 GB RAM, and

Core i7 2600S: 8MB cache, 2.8GHz, 8GB memory, SuSE Enterprise Linux version 11. Intel C++ Composer XE for Linux compiler (version 2011.1.108).


Scaling further with SIMD …

512b SIMD extension in Intel® MIC continues to scale SIMD performance while maintaining high-efficiency with hardware support for gather/scatter

20

System configuration used for performance data above: single-core performance obtained on 3.3GHz Intel Core i7 X980 (Westmere). 12 GB RAM, and

Core i7 2600S: 8MB cache, 2.8GHz, 8GB memory, SuSE Enterprise Linux version 11. Intel C++ Composer XE for Linux compiler (version 2011.1.108).

MIC data from single-core performance of a KNF card with 32cores,1.2GHz, and 2GB of GDDR


Scaling with caches

21

Manycore cache simulation

256KB/core Vs 32KB/core

Speedup: 2-3x or more

With no programming effort

Caches: Unparalleled ROI in terms of power/performance and

programming effort

Note: Illustrative simulation-based performance projection only


7-point Stencil LBM

QCD-WDS1

Does not fit in LLCQCD-WDS1

Fits in LLC

Scaling further with caches …

For top two charts, refer to: “3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs” accepted at

SC’10. Authors: Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim and Pradeep Dubey. System config for bottom two

charts: Intel Xeon Processor X5680, 6-core at 3.3GHz with 6GB of DDR3.

Caches can boost effectiveness of locality–aware transforms


Scaling even further with caches …

Performance measured on: Dual socket, Intel Xeon X5570 CPU, with 8 cores running at 2.93 GHz, 96 GB RAM, Fedora Linux Release14

Using cache-friendly data structure reduced BW requirement and

improved performance by up to two orders!

256x speedup


Putting It All Together:

Stencil Operations

24

1.5x-2.2x speedup over dual-socket Intel® Xeon® processors for Intel®

MIC / KNF across various stencil-dominated workloads

System used: Dual socket Intel Xeon processor DP X5680 (Westmere-EP), total of 12 cores running at 3.33 GHz), 96 GB RAM, and it

runs SusE Enterprise Edition Linux 11. KNF card attached over PCIe, 32cores, 1.2GHz, 2GB of GDDR. RTM algorithm is

implemented from “Selecting the right hardware for RTM”, by Clapp et al. The Leading Edge 29, 48 (2010);


~ 15x speedup

Brings compressed-sensing

Almost to the realm of interactivity

for medical practitioners!


Scaling Compressed Sensing Application

For performance details, refer to: “High-Performance 3D Compressive Sensing MRI Reconstruction” International Conference of the IEEE Engineering in Medicine and Biology Society’

(EMBS 10). Daehyun Kim, Joshua D. Trzasko, Mikhail Smelyanskiy, Clifton R. Haider, Armando Manduca, and Pradeep [email protected] July 2011

Programmer Efficiency

System configuration used for performance data above: 3.3GHz 6-core Intel Core i7 X980 (Westmere). 12 GB RAM, SuSE Enterprise Linux

version 11. Intel C++ Composer XE for Linux compiler (version 2011.1.108).

Intel tool-chain can deliver within 2x of Ninja performance with optimized

performance-to-programming effort ROI for both Intel Xeon and Intel MIC

Ninja-gap: performance gap between a naïve,

parallelism-unaware code and best-performing hand-code


Summary

Highly parallel workloads offer an opportunity to tradeoff

scalar-performance – distinguishing architectural basis of

manycore

Programming challenge of efficient heterogeneous

offload can be managed by limiting the architectural

differences

Architectural affinity with Xeon helps MIC deliver

performance with common programming model and

shared set of algorithmic optimizations

Large coherent caches, wide-SIMD, and low-overhead

threading help MIC deliver excellent performance-vs-

programming effort ROI for highly-parallel applications


Who We Are: Parallel Computing Lab Parallel Computing -- Research to Realization

Worldwide leadership in throughput/parallel computing, industry role-model for application-driven architecture research, ensuring Intel leadership for this application segment

Dual Charter:

Application-driven architecture research and multicore/manycore product-intercept opportunities

Architectural focus: “Feeding the beast’ (memory) challenge, domain-specific support, massively threaded

machines, unstructured accesses, distributed decomposition

Workload focus: Multimodal real-time physical simulation, Behavioral simulation, Interventional medical

imaging, Large-scale optimization (FSI), Massive data computing, non-numeric computing

Industry and academic co-travelers Mayo, HPI, CERN, Stanford, UNC, Georgia Tech, and others

Recent accomplishments: First TFlop SGEMM and highest performing SparseMVM on KNF silicon demo’ed at SC’09

Fastest LU/Linpack demo on KNF at ISC’10 and HeteroLU paper at ISC’11

Fastest search, sort, and relational join – Best Paper Award for Tree Search at SIGMOD 2010



Compressed Sensing Application

Current Clinical (SENSE+PF) Compressive Sensing

Single time frame from a CAPR CE-MRA exam (8-channel, R=19x, 256x160x80) [Trzasko2010]


highly parallel applications and their architectural...

Documents