1 algorithmic optimizations for many-core wide-vector processors jongsoo park parallel computing...

34
Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

Upload: damian-hugh-stevenson

Post on 26-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

1

Algorithmic Optimizations forMany-core Wide-vector Processors

Jongsoo Park

Parallel Computing Lab, Intel

Page 2: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

2

Notice and Disclaimers

Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice.

All products, dates, and figures are preliminary for planning purposes and are subject to change without notice.

Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.

Intel® Itanium®, Intel® Xeon®, Xeon Phi™, Pentium®, Intel SpeedStep® and Intel NetBurst® , Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2012, Intel Corporation. All rights reserved.

*Other names and brands may be claimed as the property of others..

2

Page 3: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

3

Notice and Disclaimers Continued …Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Page 4: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

4

Outline

• Intel’s take on many-core processors–Architecture of Knights Corner

coprocessors–Programming model

• Low-communication algorithm:“segment-of-interest” fast Fourier transform

• Approximate computing

Page 5: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

5

Intel® Xeon PhiTM Coprocessor

Page 6: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

6

Xeon PhiTM Architecture

60 cores

Ring Interconnect

512KBx60 = 30MB L2$

16-wide SIMD (SP)

8-wide SIMD (DP)

1.1GHzx60x16x2 =

2.1 TFLOPS (SP)

1.1GHzx60x8x2 =

1.05 TFLOPS (DP)

Adding 2 Xeon Phis 7x of 2-socket Sandy Bridge

512KB L2 per Core

Page 7: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

7

Xeon PhiTM Programming ModelsOpenMP, pthread

MPI: a set of cores can directly work as a MPI process

OpenCL: recently announced

Cilk+: task-level parallelism

ISPC: thread+SIMD expressed same (as CUDA)

Automatic parallelization by compiler

Other frameworks for Xeon are easy to port

Page 8: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

8

Key Performance ConsiderationsParallelism

– 60 cores– SIMD: 16-wide (SP), 8-wide (DP), gather/scatter, swizzling

Memory bandwidth– On-chip caches, non-temporal stores– Spatial locality

Memory latency hiding– 4 simultaneous multi-threads per core– Pre-fetching

PCIe latency hiding– Asynchronous offload directives– Asynchronous MPI calls

Page 9: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

9

Software Portability

Correctness portability:– Any code worked for Xeon® works right away

Performance portability:– The optimization typically speedups both Xeon®

and Xeon Phi™

Page 10: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

10

Performance Portability Example – “Segment-of-Interest” FFTThe same code is used for both Xeon® and Xeon Phi™– Same optimizations:

loop interchange/tiling, unroll-and-jam, …

– Just different tiling and unroll factors

Page 11: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

11

My Xeon Phi™ Optimization Flow –Design top-down, measure bottom-upDesign data layout and decomposition

– Maximize locality and minimize communication/synchronization– Consider vectorization (e.g., SOA vs. AOS)– Single-thread optimizations depend on this

Measure single-core performance– Check vectorization and prefetching: -vec-report compiler flag– Pragmas for vectorization and prefetching when appropriate– Convince yourself why you are achieving a measured compute

efficiency: IPC and L1/L2$ misses from vtune are useful metrics

Measure thread-level scaling– Use more cores while appropriately scaling input– Vtune metrics to look at: load balance, remote L2$ access

Page 12: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

12

Algorithmic Optimizations

Before doing all of these, ask yourself

“Is this the right algorithm for modern processors?”

Page 13: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

Low-communication Algorithms

Page 14: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

14

Communication-bound Application Example: 1D FFT

Surprisingly low compute efficiency of 1D FFT in HPCC list

~2% efficiency in K computer

Page 15: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

15

Many-core wide-vector processor + communication-bound application

Estimated 1-D FFT time for 232 DP complex numbers

with 32 nodes of (2-socket Xeon E5-2680 + 1 Xeon Phi SE10)Park et al., submitted to SC’13

Page 16: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

16

Cooley-Tukey Factorization

Circa 1965 (also 1866 by Gauss)

N=MP M length-P FFTs + P length-M FFTs

3 All-to-allCommunication

Page 17: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

17

Segment-of-Interest (SOI) Factorization

3 all-to-all 1 all-to-all

Tang, Park, Kim, and Petrov, SC’12

Page 18: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

18

Trading off computation for communication

Park et al., submitted to SC’13

Page 19: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

Approximate Computing

Page 20: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

20

Transcendental Functions inSynthetic Aperture Radar (SAR)

Transcendental functions are hard to vectorize1 3Kx3K image reconstructed from 2.8K pulses with 4K samples each, Xeon: Intel® Xeon® E5-2670, 2-

socket, 2.6GHz

Park et al., SC’12

Page 21: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

21

Approximate Strength Reduction in SAR

2-4x speedups by optimizing sqrt, sin, and cosIntel® Xeon Phi™ Results: Evaluation card only and not necessarily reflective of production card

specifications.

Xeon: Intel® Xeon® E5-2670, 2-socket, 2.6GHz

Page 22: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

22

Strength Reduction

for i from 0 to N

A[i] = c x i

1 multiplication

t = 0

for i from 0 to N

A[i] = t

t += c

1 addition

Page 23: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

23

Approximate Strength Reduction (ASR)

for each pixel (i, j) at x

R = sqrt((xi-pi)2+(xj-pj)2)

bin = (R – R0)*idr

arg =(cos(kR),sin(kR))

1 sqrt

1 cos

1 sin

pre-compute A, B, C, Φ, Ψ, Γ

for each pixel (i, j) at x

bin = A[j]+B[i]+j*C[i]

arg = Φ[j]*Ψ[i]*γ[i]

γ[i]*= Γ[i]

13 multiplications

10 additions

Page 24: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

24

Approximate Strength Reduction (ASR)

for each pixel (i, j) at x

R = sqrt((xi-pi)2+(xj-pj)2)

bin = (R – R0)*idr

arg =(cos(kR),sin(kR))

1 sqrt (DP)

1 cos (w/ DP arg. reduction)

1 sin (w/ DP arg. reduction)

pre-compute A, B, C, Φ, Ψ, Γ

for each pixel (i, j) at x

bin = A[j]+B[i]+j*C[i]

arg = Φ[j]*Ψ[i]*γ[i]

γ[i]*= Γ[i]

13 multiplications (SP)

10 additions (SP)

Page 25: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

25

Approximate Strength Reduction (ASR)

for each pixel (i, j) at x

R = sqrt((xi-pi)2+(xj-pj)2)

bin = (R – R0)*idr

arg =(cos(kR),sin(kR))

1 sqrt (DP)

1 cos (w/ DP arg. reduction)

1 sin (w/ DP arg. reduction)

pre-compute A, B, C, Φ, Ψ, Γ

for each pixel (i, j) at x

bin = A[j]+B[i]+j*C[i]

arg = Φ[j]*Ψ[i]*γ[i]

γ[i]*= Γ[i]

13 multiplications (SP)

10 additions (SP)

All vectorizable

Page 26: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

26

ASR Mathematics – Square Root

• Approximate sqrt by the 2nd-order Taylor series

• Apply the conventional strength reduction

Pre-compute constants (, , …)

incrementally compute: e.g., =

Page 27: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

27

ASR in SAR

2-4x speedups by optimizing sqrt, sin, and cosIntel® Xeon Phi™ Results: Evaluation card only and not necessarily reflective of production card

specifications.

Xeon: Intel® Xeon® E5-2670, 2-socket, 2.6GHz

Page 28: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

28

ASR Mathematics – Accuracy

• Approximate sqrt by 2nd-order Taylor series

• Error increases as i or j gets larger apply ASR per block (bound i and j)

• Mixed-precisionDP: pre-compute constantsSP: main-compute

Page 29: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

29

Accuracy and Performance Trade-offs

20 higher dB = 0.5x error64x64 blocks: ~2x speedup w/ similar accuracy

Measured in Intel® Xeon® E5-2670, 2-socket, 2.6GHz

Page 30: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

30

ASR in Ultrasound Beamforming

16x64 blocks: ~2x speedup with similar accuracyCollaboration with GE Healthcare

Measured in Intel® Xeon® E5-2670, 2-socket, 2.6GHz

Page 31: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

31

Accuracy-Performance Trade-offs in SOI FFT

Tang, Park, Kim, and Petrov, SC’12

Page 32: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

32

Conclusion

Xeon Phi– Many-core wide-vector coprocessor with software

portability

Algorithmic Optimizations– Low communication algorithms: SOI FFT– Approximation: approximate strength reduction

Future WorkGeneralization and tool support

Page 33: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

33

AcknowledgementPeter Tang

Intel Parallel Computing Lab: Daehyun Kim and Mikhail Smelyanskiy, Ganesh Bikshandi, Karthikeyan Vaidyanathan, and Pradeep Dubey

Intel MKL Team: Vladimir Petrov and Robert Hanek

Georgia Tech Research Institute: Thomas Benson, Daniel Campbell

Reservoir Lab: Nicolas Vasilache and Richard Lethin

DARPA UHPC project*

* This research was, in part, funded by the U.S. Government under contract number HR0011-10-3-0007. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Page 34: 1 Algorithmic Optimizations for Many-core Wide-vector Processors Jongsoo Park Parallel Computing Lab, Intel

34

Q&A