highly parallel applications and their architectural...
TRANSCRIPT
Highly Parallel Applications and Their Architectural Needs
Dr. Pradeep K DubeyIEEE Fellow and Director of Parallel Computing Lab
Intel Corporation
SAAHPC’11
Notice and Disclaimers
Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice.
All products, dates, and figures are preliminary for planning purposes and are subject to change without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
Intel® Itanium®, Xeon™, Pentium®, Intel SpeedStep® and Intel NetBurst® , Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2011, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others..
[email protected] July 2011
Optimization Notice – Please read
Optimization Notice
Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
Intel recommends that you evaluate other compilers to determine which best meet your
requirements.
[email protected] July 2011
Highly Parallel Applications
Can compute tame the beast of massive, unstructured, dynamic
datasets to enable real-time simulation and analytics?
[email protected] July 2011
Measures of Efficiency
Programmer productivity
Architectural impact of programmability could be even
bigger than that of power-performance
Maximize Performance / Effort ratio
Architectural efficiency
Sustaining performance growth under fixed cost-power
Performance / Peak flops (or bandwidth)
Cluster / datacenter scalability for real-time services
5
Absolute performance matters: efficiency matters no less!
Arch efficiency is about exploiting order and lowering overheads
[email protected] July 2011
Heterogeneous Computing Platform-driven
Application-driven Our focus
Degree of Heterogeneity
Compute Model A Compute Model B
Memory Mgmt: Explicit-Implicit
Data Parallelism: Explicit-Implicit
Threads: Latency-Throughput
Performance asymmetry only
Application 1 Application 2
[email protected] July 2011
From Multicore to Manycore
N
PP
S
)1(
1
N
PKP
S
N
)1(
1
1
)1(,1
N
N
NK
KNPSFor
Many-Core makes sense for workloads
with high enough “P “- parallel
component - for simplicity, we call these
Highly Parallel
S = speedup, P = parallel fraction, # of Cores = N, Kn = single thread performance (single core/multicore)
Many-Core
7
Multi-CoreSingle-Core
[email protected] July 2011
Multicore Manycore
• More cores and threads Simpler core
More latency tolerance
• More compute density Data-Parallel / SIMD
• More memory BW Lower B/F, lower capacity
Less cache/core
[email protected] July 2011
Intel® Many Integrated Core (MIC)
1st Intel® MIC product
on 22nm
>50 cores
Future MIC productsSoftware Development
Platform for MIC
“Knights Ferry” “Knights Corner” Future “Knights”
• Extending IA to flexible, fully programmable general purpose Many-Core
computing as co-processor
• Common IA programming models, languages, techniques and software
tools with Intel® Xeon® processors
• Industry leading Performance for highly parallel workloads
• Higher Efficiency than other attached solutions
• Optimized Efficiency for a heterogeneous solution in combination with
Intel Xeon processors
[email protected] July 2011
Intel® MIC Architecture Programming
Intel® Xeon®
processor
family
Intel® Xeon ®
processor
Intel® MIC
architecture
co-processor
Single Source
Compilersand Runtimes Common with Intel® Xeon®
• Languages
• C, C++, Fortran compilers
• Intel developer tools and libraries
• Coding and optimization techniques
• Ecosystem support
Eliminates Need for Dual Programming Architecture
[email protected] July 2011
Heterogeneous Programming
ArBB
MKL
TBB
C++/FTN
Cilk Plus
OpenCL
OpenMP
ArBB *
TBB
C++/FTN
Cilk Plus
MKL *
CELO
Programming MIC is the same as programming a CPU
OpenMP
OpenCL *
PCIe
PC
Ie
CPU Executable MIC Native
Executable
Heterogeneous
Compute
Para
llel
Com
pute
Para
llel
Com
pute
MYO / XN
[email protected] July 2011
Learning so far
[email protected] July 2011
Scaling with cores
For smaller problem sizes, Intel® Xeon® processor outperforms Knights Ferry (KNF), while for medium-
large problem sizes, KNF outperforms Intel Xeon processor (up to 2.5X)For performance details, refer to: “Designing and Dynamically Load Balancing Hybrid LU for Multi/Many-core” ISC’11 by Michael
Deisher, Mikhail Smelyanskiy, Brian Nickerson, Victor W. Lee, Michael Chuvelev, Pradeep Dubey
[email protected] July 2011
Benefit of big cores
When parallelism is insufficient, big-cores significantly outperform small cores
14
For performance details, refer to: “Designing and Dynamically Load Balancing Hybrid LU for Multi/Many-core” ISC’11 by Michael
Deisher, Mikhail Smelyanskiy, Brian Nickerson, Victor W. Lee, Michael Chuvelev, Pradeep Dubey
[email protected] July 2011
Scaling further with cores …
big-and-small together
15
Dynamic partitioning across Intel Xeon and Intel MIC can pay off!
Hybrid LU Performance
For performance details, refer to: “Designing and Dynamically Load Balancing Hybrid LU for Multi/Many-core” ISC’11 by Michael
Deisher, Mikhail Smelyanskiy, Brian Nickerson, Victor W. Lee, Michael Chuvelev, Pradeep Dubey
[email protected] July 2011
Platform scaling with cards
Hybrid scaling with cores and cards
16
System used: Dual socket Intel Xeon processor DP X5680 (Westmere-EP), total of 12 cores running at 3.33 GHz), 24 GB RAM, and it runs RedHat
Enterprise Edition Linux 6, four KNF cards attached over PCIe, each with: 30cores, at 1.05GHz, with 1GB GDDR.
[email protected] July 2011
Scaling with threads
0
10
20
30
40
50
60
70
80
90
1t
2t
4t
8t
16t
32t
64t
128t
Low-Overhead Barrier
Default Barrier
Manycore research simulation of
Conjugate Gradient kernel
Note: Illustrative scalability performance projection
above only based on a research manycore simulation
Threading overheads matter with more threads and should be kept low
Measured performance of Forward
Solver kernel
Low-overhead of threading primitives
help MIC achieve crossover wrt Xeon
at low matrix sizes (e.g., 8K below)
17
System used: Dual socket Intel Xeon processor DP X5680 (Westmere-EP), total of 12 cores running at 3.33 GHz), 24 GB RAM, and it runs Red Hat
Enterprise Edition Linux 6, KNF card attached over PCIe, with: 32cores, at 1.2GHz, with 2GB GDDR.
[email protected] July 2011
Scaling further with threads …
Redesigned data-structures
System used: Dual socket Intel Xeon processor DP X5680 (Westmere-EP), total of 12 cores running at 3.33 GHz), 96 GB RAM, and it
runs SusE Enterprise Edition Linux 11. KNF card attached over PCIe, 32cores, 1.2GHz, 2GB of GDDR
Lock-free data structures can speed-up tree insertions 12x-19xIntel® MIC offers 2x additional performance boost over Intel® Xeon
[email protected] July 2011
Scaling with SIMD
SIMD: most-efficient form of dense computation
19
System configuration used for performance data above: single-core performance obtained on 3.3GHz Intel Core i7 X980 (Westmere). 12 GB RAM, and
Core i7 2600S: 8MB cache, 2.8GHz, 8GB memory, SuSE Enterprise Linux version 11. Intel C++ Composer XE for Linux compiler (version 2011.1.108).
[email protected] July 2011
Scaling further with SIMD …
512b SIMD extension in Intel® MIC continues to scale SIMD performance while maintaining high-efficiency with hardware support for gather/scatter
20
System configuration used for performance data above: single-core performance obtained on 3.3GHz Intel Core i7 X980 (Westmere). 12 GB RAM, and
Core i7 2600S: 8MB cache, 2.8GHz, 8GB memory, SuSE Enterprise Linux version 11. Intel C++ Composer XE for Linux compiler (version 2011.1.108).
MIC data from single-core performance of a KNF card with 32cores,1.2GHz, and 2GB of GDDR
[email protected] July 2011
Scaling with caches
21
Manycore cache simulation
256KB/core Vs 32KB/core
Speedup: 2-3x or more
With no programming effort
Caches: Unparalleled ROI in terms of power/performance and
programming effort
Note: Illustrative simulation-based performance projection only
[email protected] July 2011
7-point Stencil LBM
QCD-WDS1
Does not fit in LLCQCD-WDS1
Fits in LLC
Scaling further with caches …
For top two charts, refer to: “3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs” accepted at
SC’10. Authors: Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim and Pradeep Dubey. System config for bottom two
charts: Intel Xeon Processor X5680, 6-core at 3.3GHz with 6GB of DDR3.
Caches can boost effectiveness of locality–aware transforms
[email protected] July 2011
Scaling even further with caches …
Performance measured on: Dual socket, Intel Xeon X5570 CPU, with 8 cores running at 2.93 GHz, 96 GB RAM, Fedora Linux Release14
Using cache-friendly data structure reduced BW requirement and
improved performance by up to two orders!
256x speedup
[email protected] July 2011
Putting It All Together:
Stencil Operations
24
1.5x-2.2x speedup over dual-socket Intel® Xeon® processors for Intel®
MIC / KNF across various stencil-dominated workloads
System used: Dual socket Intel Xeon processor DP X5680 (Westmere-EP), total of 12 cores running at 3.33 GHz), 96 GB RAM, and it
runs SusE Enterprise Edition Linux 11. KNF card attached over PCIe, 32cores, 1.2GHz, 2GB of GDDR. RTM algorithm is
implemented from “Selecting the right hardware for RTM”, by Clapp et al. The Leading Edge 29, 48 (2010);
[email protected] July 2011
~ 15x speedup
Brings compressed-sensing
Almost to the realm of interactivity
for medical practitioners!
Putting It All Together:
Scaling Compressed Sensing Application
For performance details, refer to: “High-Performance 3D Compressive Sensing MRI Reconstruction” International Conference of the IEEE Engineering in Medicine and Biology Society’
(EMBS 10). Daehyun Kim, Joshua D. Trzasko, Mikhail Smelyanskiy, Clifton R. Haider, Armando Manduca, and Pradeep [email protected] July 2011
Programmer Efficiency
System configuration used for performance data above: 3.3GHz 6-core Intel Core i7 X980 (Westmere). 12 GB RAM, SuSE Enterprise Linux
version 11. Intel C++ Composer XE for Linux compiler (version 2011.1.108).
Intel tool-chain can deliver within 2x of Ninja performance with optimized
performance-to-programming effort ROI for both Intel Xeon and Intel MIC
Ninja-gap: performance gap between a naïve,
parallelism-unaware code and best-performing hand-code
[email protected] July 2011
Summary
Highly parallel workloads offer an opportunity to tradeoff
scalar-performance – distinguishing architectural basis of
manycore
Programming challenge of efficient heterogeneous
offload can be managed by limiting the architectural
differences
Architectural affinity with Xeon helps MIC deliver
performance with common programming model and
shared set of algorithmic optimizations
Large coherent caches, wide-SIMD, and low-overhead
threading help MIC deliver excellent performance-vs-
programming effort ROI for highly-parallel applications
[email protected] July 2011
[email protected] July 2011
Who We Are: Parallel Computing Lab Parallel Computing -- Research to Realization
Worldwide leadership in throughput/parallel computing, industry role-model for application-driven architecture research, ensuring Intel leadership for this application segment
Dual Charter:
Application-driven architecture research and multicore/manycore product-intercept opportunities
Architectural focus: “Feeding the beast’ (memory) challenge, domain-specific support, massively threaded
machines, unstructured accesses, distributed decomposition
Workload focus: Multimodal real-time physical simulation, Behavioral simulation, Interventional medical
imaging, Large-scale optimization (FSI), Massive data computing, non-numeric computing
Industry and academic co-travelers Mayo, HPI, CERN, Stanford, UNC, Georgia Tech, and others
Recent accomplishments: First TFlop SGEMM and highest performing SparseMVM on KNF silicon demo’ed at SC’09
Fastest LU/Linpack demo on KNF at ISC’10 and HeteroLU paper at ISC’11
Fastest search, sort, and relational join – Best Paper Award for Tree Search at SIGMOD 2010
[email protected] July 2011
Putting It All Together:
Compressed Sensing Application
Current Clinical (SENSE+PF) Compressive Sensing
Single time frame from a CAPR CE-MRA exam (8-channel, R=19x, 256x160x80) [Trzasko2010]
[email protected] July 2011