1 algorithmic optimizations for many-core wide-vector processors jongsoo park parallel computing...
TRANSCRIPT
1
Algorithmic Optimizations forMany-core Wide-vector Processors
Jongsoo Park
Parallel Computing Lab, Intel
2
Notice and Disclaimers
Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice.
All products, dates, and figures are preliminary for planning purposes and are subject to change without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
Intel® Itanium®, Intel® Xeon®, Xeon Phi™, Pentium®, Intel SpeedStep® and Intel NetBurst® , Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2012, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others..
2
3
Notice and Disclaimers Continued …Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804
4
Outline
• Intel’s take on many-core processors–Architecture of Knights Corner
coprocessors–Programming model
• Low-communication algorithm:“segment-of-interest” fast Fourier transform
• Approximate computing
5
Intel® Xeon PhiTM Coprocessor
6
Xeon PhiTM Architecture
60 cores
Ring Interconnect
512KBx60 = 30MB L2$
16-wide SIMD (SP)
8-wide SIMD (DP)
1.1GHzx60x16x2 =
2.1 TFLOPS (SP)
1.1GHzx60x8x2 =
1.05 TFLOPS (DP)
Adding 2 Xeon Phis 7x of 2-socket Sandy Bridge
512KB L2 per Core
7
Xeon PhiTM Programming ModelsOpenMP, pthread
MPI: a set of cores can directly work as a MPI process
OpenCL: recently announced
Cilk+: task-level parallelism
ISPC: thread+SIMD expressed same (as CUDA)
Automatic parallelization by compiler
Other frameworks for Xeon are easy to port
…
8
Key Performance ConsiderationsParallelism
– 60 cores– SIMD: 16-wide (SP), 8-wide (DP), gather/scatter, swizzling
Memory bandwidth– On-chip caches, non-temporal stores– Spatial locality
Memory latency hiding– 4 simultaneous multi-threads per core– Pre-fetching
PCIe latency hiding– Asynchronous offload directives– Asynchronous MPI calls
9
Software Portability
Correctness portability:– Any code worked for Xeon® works right away
Performance portability:– The optimization typically speedups both Xeon®
and Xeon Phi™
10
Performance Portability Example – “Segment-of-Interest” FFTThe same code is used for both Xeon® and Xeon Phi™– Same optimizations:
loop interchange/tiling, unroll-and-jam, …
– Just different tiling and unroll factors
11
My Xeon Phi™ Optimization Flow –Design top-down, measure bottom-upDesign data layout and decomposition
– Maximize locality and minimize communication/synchronization– Consider vectorization (e.g., SOA vs. AOS)– Single-thread optimizations depend on this
Measure single-core performance– Check vectorization and prefetching: -vec-report compiler flag– Pragmas for vectorization and prefetching when appropriate– Convince yourself why you are achieving a measured compute
efficiency: IPC and L1/L2$ misses from vtune are useful metrics
Measure thread-level scaling– Use more cores while appropriately scaling input– Vtune metrics to look at: load balance, remote L2$ access
12
Algorithmic Optimizations
Before doing all of these, ask yourself
“Is this the right algorithm for modern processors?”
Low-communication Algorithms
14
Communication-bound Application Example: 1D FFT
Surprisingly low compute efficiency of 1D FFT in HPCC list
~2% efficiency in K computer
15
Many-core wide-vector processor + communication-bound application
Estimated 1-D FFT time for 232 DP complex numbers
with 32 nodes of (2-socket Xeon E5-2680 + 1 Xeon Phi SE10)Park et al., submitted to SC’13
16
Cooley-Tukey Factorization
Circa 1965 (also 1866 by Gauss)
N=MP M length-P FFTs + P length-M FFTs
3 All-to-allCommunication
17
Segment-of-Interest (SOI) Factorization
3 all-to-all 1 all-to-all
Tang, Park, Kim, and Petrov, SC’12
18
Trading off computation for communication
Park et al., submitted to SC’13
Approximate Computing
20
Transcendental Functions inSynthetic Aperture Radar (SAR)
Transcendental functions are hard to vectorize1 3Kx3K image reconstructed from 2.8K pulses with 4K samples each, Xeon: Intel® Xeon® E5-2670, 2-
socket, 2.6GHz
Park et al., SC’12
21
Approximate Strength Reduction in SAR
2-4x speedups by optimizing sqrt, sin, and cosIntel® Xeon Phi™ Results: Evaluation card only and not necessarily reflective of production card
specifications.
Xeon: Intel® Xeon® E5-2670, 2-socket, 2.6GHz
22
Strength Reduction
for i from 0 to N
A[i] = c x i
1 multiplication
t = 0
for i from 0 to N
A[i] = t
t += c
1 addition
23
Approximate Strength Reduction (ASR)
for each pixel (i, j) at x
R = sqrt((xi-pi)2+(xj-pj)2)
bin = (R – R0)*idr
arg =(cos(kR),sin(kR))
…
1 sqrt
1 cos
1 sin
pre-compute A, B, C, Φ, Ψ, Γ
for each pixel (i, j) at x
bin = A[j]+B[i]+j*C[i]
arg = Φ[j]*Ψ[i]*γ[i]
γ[i]*= Γ[i]
13 multiplications
10 additions
24
Approximate Strength Reduction (ASR)
for each pixel (i, j) at x
R = sqrt((xi-pi)2+(xj-pj)2)
bin = (R – R0)*idr
arg =(cos(kR),sin(kR))
…
1 sqrt (DP)
1 cos (w/ DP arg. reduction)
1 sin (w/ DP arg. reduction)
pre-compute A, B, C, Φ, Ψ, Γ
for each pixel (i, j) at x
bin = A[j]+B[i]+j*C[i]
arg = Φ[j]*Ψ[i]*γ[i]
γ[i]*= Γ[i]
13 multiplications (SP)
10 additions (SP)
25
Approximate Strength Reduction (ASR)
for each pixel (i, j) at x
R = sqrt((xi-pi)2+(xj-pj)2)
bin = (R – R0)*idr
arg =(cos(kR),sin(kR))
…
1 sqrt (DP)
1 cos (w/ DP arg. reduction)
1 sin (w/ DP arg. reduction)
pre-compute A, B, C, Φ, Ψ, Γ
for each pixel (i, j) at x
bin = A[j]+B[i]+j*C[i]
arg = Φ[j]*Ψ[i]*γ[i]
γ[i]*= Γ[i]
13 multiplications (SP)
10 additions (SP)
All vectorizable
26
ASR Mathematics – Square Root
• Approximate sqrt by the 2nd-order Taylor series
• Apply the conventional strength reduction
Pre-compute constants (, , …)
incrementally compute: e.g., =
27
ASR in SAR
2-4x speedups by optimizing sqrt, sin, and cosIntel® Xeon Phi™ Results: Evaluation card only and not necessarily reflective of production card
specifications.
Xeon: Intel® Xeon® E5-2670, 2-socket, 2.6GHz
28
ASR Mathematics – Accuracy
• Approximate sqrt by 2nd-order Taylor series
• Error increases as i or j gets larger apply ASR per block (bound i and j)
• Mixed-precisionDP: pre-compute constantsSP: main-compute
29
Accuracy and Performance Trade-offs
20 higher dB = 0.5x error64x64 blocks: ~2x speedup w/ similar accuracy
Measured in Intel® Xeon® E5-2670, 2-socket, 2.6GHz
30
ASR in Ultrasound Beamforming
16x64 blocks: ~2x speedup with similar accuracyCollaboration with GE Healthcare
Measured in Intel® Xeon® E5-2670, 2-socket, 2.6GHz
31
Accuracy-Performance Trade-offs in SOI FFT
Tang, Park, Kim, and Petrov, SC’12
32
Conclusion
Xeon Phi– Many-core wide-vector coprocessor with software
portability
Algorithmic Optimizations– Low communication algorithms: SOI FFT– Approximation: approximate strength reduction
Future WorkGeneralization and tool support
33
AcknowledgementPeter Tang
Intel Parallel Computing Lab: Daehyun Kim and Mikhail Smelyanskiy, Ganesh Bikshandi, Karthikeyan Vaidyanathan, and Pradeep Dubey
Intel MKL Team: Vladimir Petrov and Robert Hanek
Georgia Tech Research Institute: Thomas Benson, Daniel Campbell
Reservoir Lab: Nicolas Vasilache and Richard Lethin
DARPA UHPC project*
* This research was, in part, funded by the U.S. Government under contract number HR0011-10-3-0007. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
34
Q&A