productive parallel programming for intel xeon phi coprocessors

52
Productive Parallel Programming for Intel® Xeon Phi™ Coprocessors Bill Magro Director and Chief Technologist Technical Computing Software Intel Software & Services Group

Upload: insidehpc

Post on 06-May-2015

571 views

Category:

Technology


3 download

DESCRIPTION

In this video from Moabcon 2013, Bill Magro from Intel presents: Productive Parallel Programming for Intel Xeon Phi Coprocessors. Learn more at: http://www.adaptivecomputing.com/company/news-and-events/events/moabcon-2013/moabcon-2013-full-agenda/ and http://www.intel.com/content/www/us/en/high-performance-computing/high-performance-xeon-phi-coprocessor-brief.html You can watch the video of this presentation at: http://insidehpc.com/?p=36407

TRANSCRIPT

Page 1: Productive parallel programming for intel xeon phi coprocessors

Productive Parallel Programming for Intel® Xeon Phi™ Coprocessors

Bill Magro!Director and Chief Technologist!Technical Computing Software!Intel Software & Services Group!

Page 2: Productive parallel programming for intel xeon phi coprocessors

Still an Insatiable Need For Computing

PetaFlop Systems of Today Are The Client And Handheld Systems 10 years Later

10 PFlops 1 PFlops

100 TFlops 10 TFlops 1 TFlops

100 GFlops 10 GFlops

1 GFlops 100 MFlops

100 PFlops

10 EFlops 1 EFlops

100 EFlops

1993 2017 1999 2005 2011 2023

1 ZFlops

2029

Weather Prediction

Medical Imaging

Genomics Research

Source: www.top500.org

Forecast

Page 3: Productive parallel programming for intel xeon phi coprocessors

Approaching Exascale

Page 4: Productive parallel programming for intel xeon phi coprocessors

Some believe…

•  Virtually none of today’s hardware or software technologies can be improved or modified to reach exascale

•  A complete revolution is needed We believe…

•  Evolution of today’s technologies + hardware and software innovation can get us there

•  A systems approach – with co-design – is critical

Page 5: Productive parallel programming for intel xeon phi coprocessors

2003 2005 2007 2009 2011

90 nm 65 nm 45 nm 32 nm 22 nm

Invented SiGe

Strained Silicon

2nd Gen. SiGe

Strained Silicon

2nd Gen. Gate-Last

High-k Metal Gate

Invented Gate-Last

High-k Metal Gate

First to Implement Tri-Gate

STRAINED SILICON

HIGH-k METAL GATE

TRI-GATE

22nm A Revolutionary

Leap in Process

Technology

37% Performance Gain at

Low Voltage*

>50% Active Power

Reduction at Constant Performance*

Moore’s Law: Alive and Well

The foundation for all computing… including Exa-Scale

SiGeSiGe

Source: Intel *Compared to Intel 32nm Technology

Page 6: Productive parallel programming for intel xeon phi coprocessors

Intel® Xeon Phi™ Coprocessor [Knights Corner]: Power Efficiency

Performance per Watt of a prototype Knights Corner Cluster compared to the 2 Top Graphics Accelerated Clusters

Copyright © 2012 Intel Corporation. All rights reserved. 6 Visual and Parallel Computing Group

1381 1380 1266

0

200

400

600

800

1000

1200

1400

MFL

OP

S/W

att

Higher is Better Source: www.green500.org

Intel Corp Knights Corner Top500 #150 June 2012 72.9 kW

Nagasaki Univ. ATI Radeon Top500 #456 June 2012 47 kW

Barcelona Supercomputing Center Nvidia Tesla 2090 Top500 #177 June 2012 81.5 kW

+ + +

Page 7: Productive parallel programming for intel xeon phi coprocessors

Myth: explicitly managed locality is in and caches are out! Reality: Caches remain path to high performance and

efficiency

0

5

10

15

20

25

30

35

40

45

50

Memory BW L2 Cache BW L1 Cache BW

Relative BW Relative BW/Watt

Page 8: Productive parallel programming for intel xeon phi coprocessors

#1 Green 500 Cluster WORLD RECORD!

“Beacon” at NICS Intel® Xeon® Processor +

Intel Xeon Phi™ Coprocessor Cluster Most Power Efficient on the List

2.449 GigaFLOPS / Watt 70.1% efficiency

Other brands and names are the property of their respective owners. Source: www.green500.org as of Nov 2012

Page 9: Productive parallel programming for intel xeon phi coprocessors

Reaching Exascale Power Goals Requires Architectural & Systems Focus

•  Memory (2x-5x) –  New memory interfaces (optimized memory control and xfer) –  Extend DRAM with non-volatile memory

•  Processor (10x-20x) –  Reducing data movement (functional reorganization, > 20x) –  Domain/Core power gating and aggressive voltage scaling

•  Interconnect (2x-5x) –  More interconnect on package –  Replace long haul copper with integrated optics

•  Data Center Energy Efficiencies (10%-20%) –  Higher operating temperature tolerance –  480V to the rack and free air/water cooling efficiencies

9

Page 10: Productive parallel programming for intel xeon phi coprocessors

Reliability of these machines requires a systems approach

•  Transparent process migration •  Holistic fault detection and recovery •  Reliable end to end communications •  Integrated memory in storage layer for

fast checkpoint and workflow •  N+1 scale reliable architectures

consistent with stacked memory constraints

•  System wide power management and dynamic optimization

•  Must design for system level debug capability.

Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008),

Reliability is the primary force driving next generation designs

’93 ‘95 ‘97 ‘99 ’01 ‘03 ‘05 ‘07 ‘09 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 Top System Concurrency Trend

Extreme Parallelism

MTTI Measured in Minutes

0.1 Failures per socket per year:

1E+04

1E+05 1E+06

1E+07 1E+08 DRAM Chip Count

Socket Count

2004 2006 2008 2010 2012 2014 2016 0

1

10

100 1000

MTT

I (ho

urs)

Cou

nt

Time to save a global Global Checkpoint

Tim

e

Crossover point

Page 11: Productive parallel programming for intel xeon phi coprocessors

Foundation of Performance: Computing

Page 12: Productive parallel programming for intel xeon phi coprocessors

12

Intel® Xeon® processor •  Ground-breaking real-world application performance •  Industry-leading energy efficiency •  Meets broad set of HPC challenges

Architecture for Discovery

Intel® Xeon Phi™ product family •  Based on Intel® Many Integrated Core (MIC) architecture •  Leading performance for highly parallel workloads •  Common Intel Xeon programming model •  Productive solution for highly-parallel computing

Page 13: Productive parallel programming for intel xeon phi coprocessors

1  Over previous generation Intel® processors. Intel internal estimate. For more legal information on performance forecasts go to http://www.intel.com/performance

13

Up to 73% performance boost vs. prior gen1 on HPC suite of applications Over 2X improvement on key industry benchmarks Significantly reduce compute time on large, complex data sets with Intel® Advanced Vector Extensions Integrated I/O cuts latency while adding capacity & bandwidth

Up to 4 channels DDR3 1600 memory

Up to 8 cores Up to 20 MB cache

Integrated PCI Express*

Intel® Xeon® E5-2600 processors

Page 14: Productive parallel programming for intel xeon phi coprocessors

Introducing Intel® Xeon Phi™ Coprocessors Highly-parallel Processing for Unparalleled Discovery

14

Groundbreaking: differences

Up to 61 IA cores/1.1 GHz/ 244 Threads

Up to 8GB memory with up to 352 GB/s bandwidth

512-bit SIMD instructions

Linux operating system, IP addressable

Standard programming languages and tools

Leading to Groundbreaking results

Up to 1 TeraFlop/s double precision peak performance1 Enjoy up to 2.2x higher memory bandwidth than on an Intel®

Xeon® processor E5 family-based server.2

Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.

Page 15: Productive parallel programming for intel xeon phi coprocessors

0

0.4

0.8 0

1

2

3

4

5

6

7

8

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor versus a standard multi-core Intel® Xeon® processor

BIG GAINS FOR SELECT APPLICATIONS

Fraction Parallel

% Vector

Performance

Scale to many-core

Vectorize

Parallelize

15

Page 16: Productive parallel programming for intel xeon phi coprocessors

Performance Potential of Intel® Xeon Phi™ Coprocessors

Page 17: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

640

1,860

0

500

1000

1500

2000

2S Intel® Xeon® Processor

1 Intel® Xeon Phi™ coprocessor

SGEMM (GF/s)

Synthetic Benchmark Summary (Intel® MKL)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes. For more information go to http://www.intel.com/performance

Up to 2.9X

309

883

0

200

400

600

800

1000

2S Intel® Xeon® processor

1 Intel® Xeon Phi™ coprocessor

DGEMM (GF/s)

303

803

0

200

400

600

800

1000

2S Intel® Xeon® processor

1 Intel® Xeon Phi™ coprocessor

HPLinpack (GF/s)

79

175 181

0

50

100

150

200

2S Intel® Xeon® processor

1 Intel® Xeon Phi™

coprocessor

1 Intel® Xeon Phi™

coprocessor

STREAM Triad (GB/s)

Up to 2.8X

Up to 2.6X

Up to 2.2X

Notes 1.  Intel® Xeon® Processor E5-2670 used for all SGEMM Matrix = 13824 x 13824 , DGEMM Matrix 7936 x 7936, SMP Linpack Matrix 30720 x 30720 2.  Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold Release Candidate” SW stack SGEMM Matrix = 15360 x 15360, DGEMM Matrix 7680 x 7680, SMP Linpack Matrix 26872 x 28672

EC

C O

n

EC

C O

ff

86%

Effi

cien

t

82%

Effi

cien

t

75%

Effi

cien

t

Higher is Better Higher is Better Higher is Better Higher is Better

Page 18: Productive parallel programming for intel xeon phi coprocessors

145X FASTER

67.097 SECONDS

0.46 SECONDS

2.3X FASTER

0.197 SECONDS

STEP 2. USE COPROCESSORS

Run all or part of the optimized code on Intel®

Xeon Phi™ coprocessors

Current Performance

STEP 1. OPTIMIZE CODE

Parallelize and vectorize code and continue to run on

multi-core Intel Xeon processors

STARTING POINT Typical serial code

running on multi-core Intel® Xeon® processors

PARALLELIZING FOR HIGH PERFORMANCE Example: SAXPY

340X FASTER

18

Page 19: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Performance Proof-Point: Government and Academic Research WEATHER RESEARCH AND FORECASTING (WRF)

1

1.45

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6

Speedup (Higher is Better)

• 2S Intel® Xeon® processor E5-2670 with four-node cluster configuration

• 2S Intel® Xeon® processor E5-2670 +

Intel® Xeon Phi™ coprocessor (pre-production HW/SW) with four-node cluster configuration

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel or third party measured results as of December, 2012. Configuration Details: Please see backup slides.. For more information go to http://www.intel.com/performance Any difference in system hardware or software design or configuration may affect actual performance.

19

•  Application: Weather Research and Forecasting (WRF)

•  Status: WRF v3.5 coming soon

•  Code Optimization: –  Approximately two dozen files with less than 2,000

lines of code were modified (out of approximately 700,000 lines of code in about 800 files, all Fortran standard compliant)

–  Most modifications improved performance for both the host and the co-processors

•  Performance Measurements: V3.5Pre and NCAR supported CONUS2.5KM benchmark (a high resolution weather forecast)

•  Acknowledgments: There were many contributors to these results, including the National Renewable Energy Laboratory and The Weather Channel Companies

Page 20: Productive parallel programming for intel xeon phi coprocessors

INTEL CONFIDENTIAL 20

PROVEN PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor

UP TO

2.23X Acceleware 8th Order Isotropic

Variable Velocity1

Seismic

UP TO

2X Sandia National

Labs MiniFE2 Finite Element Analysis

20

1.  2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted) 2.  8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero) 3.  2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)

UP TO

2.05X China Oil & Gas

Geoeast Pre-stack Time Migration3

Page 21: Productive parallel programming for intel xeon phi coprocessors

INTEL CONFIDENTIAL 21

PROVEN PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor

UP TO 10.75X

Monte Carlo SP2

Finance

UP TO

2.7X Jefferson Lab

Lattice QCD Physics

UP TO 7X Black-Scholes SP2

21

1.  2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted) 2.  Includes additional FLOPS from transcendental function unit 3.  Intel Measured Oct. 2012

SPEED-UP

1.8X Intel Labs

Ray Tracing3

Embree Ray Tracing

Page 22: Productive parallel programming for intel xeon phi coprocessors

Achieving Productive Parallelism with Intel® Xeon Phi™ Coprocessors

Page 23: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

•  Industry-leading performance from advanced compilers

•  Comprehensive libraries

•  Parallel programming models

•  Insightful analysis tools

More Cores. Wider Vectors. Performance Delivered Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013

Serial Performance

Scaling Performance

Efficiently

Multicore Many-core

128 Bits

256 Bits

512 Bits

50+ cores

More Cores

Wider Vectors Task & Data Parallel

Performance

Distributed Performance

Page 24: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Parallel Performance Potential

•  If your performance needs are met by a an Intel Xeon® processor, they will be achieved with fewer threads than on a coprocessor

•  On a coprocessor: –  Need more threads to

achieve same performance –  Same thread count can

yield less performance

Intel® Xeon Phi™ excels on highly parallel applications

Page 25: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Maximizing Parallel Performance

• Two fundamental considerations: – Scaling: Does the application already scale to the limits of Intel®

Xeon® processors?

– Vectorization and memory usage: Does the application make good use of vectors or is it memory bandwidth bound?

•  If both are true for an application, then the highly parallel and power-efficient Intel Xeon Phi coprocessor is most likely to be worth evaluating.

Page 26: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel® Family of Parallel Programming Models

Intel® Cilk™ Plus C/C++ language extensions to simplify parallelism Open sourced & Also an Intel product

Intel® Threading Building Blocks Widely used C++ template library for parallelism Open sourced & Also an Intel product

Domain-Specific Libraries Intel® Math Kernel Library

Established Standards Message Passing Interface (MPI) OpenMP* Coarray Fortran OpenCL*

Research and Development Intel® Concurrent Collections Offload Extensions Intel® SPMD Parallel Compiler

Choice of high-performance parallel programming models

Applicable to Multi-core and Many-core Programming *

Page 27: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Many-core Multicore

Single-source approach to Multi- and Many-Core

Multicore CPU Multicore CPU

Intel® MIC architecture co-processor

Source

Compilers Libraries,

Parallel Models

Clusters with Multicore and Many-core

… …

Multicore Cluster

Clusters “Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL

“R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011”

Page 28: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel® Inspector XE, Intel® VTune™ Amplifier XE,

Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP

Page 29: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel® Inspector XE, Intel® VTune™ Amplifier

XE, Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP

Intel® Parallel Studio XE

Page 30: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel® Inspector XE, Intel® VTune™ Amplifier

XE, Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP

Intel® Parallel Studio XE

Intel® Trace Analyzer and Collector

Intel® MPI Library

Page 31: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel® Inspector XE, Intel® VTune™ Amplifier

XE, Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP

Intel® Parallel Studio XE

Intel® Trace Analyzer and Collector

Intel® MPI Library

Page 32: Productive parallel programming for intel xeon phi coprocessors

Intel® Xeon Phi™ Coprocessor- Game Changer for HPC Build your applications on a known compute platform…

and watch them take off sooner.

With Intel® Xeon Phi™ Coprocessor

“We ported millions of lines of code in only days and completed accurate runs. Unparalleled productivity… most of this software does not run on a GPU and never will”. — Robert Harrison,

National Institute for Computational Sciences, Oak Ridge National Laboratory

Complex code porting

With restrictive special purpose hardware

32

New learning

Familiar tools & runtimes

7 Harrison, Robert. “Opportunities and Challenges Posed by Exascale Computing—ORNL's Plans and Perspectives.” National Institute of Computational Sciences (NICS), 2011.

Page 33: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Achieving Parallelism in Applications

Intel®  Math  Kernel  Library  MPI*  

Intel®  Threading  Building  Blocks  

Intel®  Cilk™  Plus

OpenMP*

Pthreads*  

Intel®  Math  Kernel  Library  

Array  NotaBon:    Intel®  Cilk™  Plus  

Auto  vectorizaBon  

Semi-­‐auto  vectorizaBon:          #pragma  (vector,  ivdep,    simd)  

OpenCL*  

C/C++  Vector  Classes                  (F32vec16,  F64vec8)    

Intrinsics  

Ease  of  use  

Fine  control  

Parallelization Options Vector Options

IA Benefit: Wide Range of Development Options

Page 34: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Range of models to meet application needs

Foo(  )  Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Spectrum of Programming Models and Mindsets

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

CPU

Many-core

Co-processor

Multi-Core Centric Many-Core Centric

MulB-­‐Core  Hosted  General  purpose  serial  and  parallel  

compu0ng  

Offload  Codes  with  highly-­‐  parallel  phases  

Many  Core  Hosted  Highly-­‐parallel  codes  

Symmetric  Codes  with  balanced  

needs  

CPU Coprocessor

Page 35: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

MIC Host

PCIe

Platform Runtimes

Operating Environment View

ABI

File I/O

Intel® Xeon® processor

“LSB”

Linux

ABI

SCIF Sockets / OFED

Linux Standard Base: •  IP •  SSH •  NFS

A flexible, familiar, compatible operating environment

Intel® Xeon Phi™ coprocessor

Page 36: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

MIC

PThreads PThreads OpenMP OpenMP

Cilk Plus

Cilk Plus

TBB

AO-MKL AO-MKL

Host

PCIe

TBB MKL

Programming View

OpenCL

Intel ® Xeon® Processor

C++/FTN

OpenCL

C++/FTN Language Extensions for Offload

SCIF / OFED / IP

MKL Intra-node

parallel

Node performance and offload

Same Parallel Models for Processor and Co-processor

MPI MPI Intra- and inter-node parallel

Intel® Xeon Phi™ coprocessor

Page 37: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Range of models to meet application needs

Foo(  )  Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Spectrum of Programming Models and Mindsets

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

CPU

Many-core

Co-processor

Multi-Core Centric Many-Core Centric

MulB-­‐Core  Hosted  General  purpose  serial  and  parallel  

compu0ng  

Offload  Codes  with  highly-­‐  parallel  phases  

Many  Core  Hosted  Highly-­‐parallel  codes  

Symmetric  Codes  with  balanced  

needs  

CPU Coprocessor

Page 38: Productive parallel programming for intel xeon phi coprocessors

Programming Intel® Xeon Phi™ based Systems (MPI+Offload)

•  MPI ranks on Intel® Xeon® processors (only)

•  All messages into/out of Xeon processors

•  Offload models used to accelerate MPI ranks

•  TBB, OpenMP*, Cilk Plus, Pthreads within coprocessor

•  Homogenous network of hybrid nodes:

       

CPU  

       MIC  

       

CPU  

       MIC  

       

CPU  

       MIC  

       

CPU  

       MIC  

Network  

Data  

Data  

Data  

Data  

MPI

Offload

Page 39: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Offload Code Examples •  C/C++  Offload  Pragma  #pragma  offload  target  (mic)  #pragma  omp  parallel  for  reducBon(+:pi)    for  (i=0;  i<count;  i++)    {                  float  t  =  (float)((i+0.5)/count);                  pi  +=  4.0/(1.0+t*t);    }    pi  /=  count;  

•  FuncBon  Offload  Example  #pragma  offload  target(mic)                in(transa,  transb,  N,  alpha,  beta)  \                in(A:length(matrix_elements))  \                in(B:length(matrix_elements))  \                inout(C:length(matrix_elements))                      sgemm(&transa,  &transb,  &N,  &N,  &N,  &alpha,  A,  &N,  B,  &N,  &beta,  C,  &N);    

•  Fortran  Offload  DirecBve  !dir$  omp  offload  target(mic)  !$omp  parallel  do              do  i=1,10                A(i)  =  B(i)  *  C(i)              enddo    

•  C/C++    Language  Extension  class  _Shared  common    {    int  data1;    char  *data2;    class  common  *next;    void  process();  

};  _Shared  class  common    obj1,  obj2;  _Cilk  _spawn      _Offload  obj1.process();  _Cilk_spawn                                          obj2.process();  

Page 40: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Range of models to meet application needs

Foo(  )  Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Spectrum of Programming Models and Mindsets

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

CPU

Many-core

Co-processor

Multi-Core Centric Many-Core Centric

MulB-­‐Core  Hosted  General  purpose  serial  and  parallel  

compu0ng  

Offload  Codes  with  highly-­‐  parallel  phases  

Many  Core  Hosted  Highly-­‐parallel  codes  

Symmetric  Codes  with  balanced  

needs  

CPU Coprocessor

Page 41: Productive parallel programming for intel xeon phi coprocessors

Programming Intel® Xeon Phi™ based Systems (MIC Native)

•  MPI ranks on Intel MIC (only) •  All messages into/out of

coprocessor •  TBB, OpenMP*, Cilk Plus,

Pthreads used directly within MPI processes

•  Programmed as homogenous network of many-core CPUs:

       

CPU  

       MIC  

       

CPU  

       MIC  

       

CPU  

       MIC  

       

CPU  

       MIC  

Network  

Data  

Data  

Data  

Data  

MPI

Page 42: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Range of models to meet application needs

Foo(  )  Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Spectrum of Programming Models and Mindsets

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

CPU

Many-core

Co-processor

Multi-Core Centric Many-Core Centric

MulB-­‐Core  Hosted  General  purpose  serial  and  parallel  

compu0ng  

Offload  Codes  with  highly-­‐  parallel  phases  

Many  Core  Hosted  Highly-­‐parallel  codes  

Symmetric  Codes  with  balanced  

needs  

CPU Coprocessor

Page 43: Productive parallel programming for intel xeon phi coprocessors

Programming Intel® MIC-based Systems (Symmetric)

•  MPI ranks on coprocessor and Intel® Xeon® processors

•  Messages to/from any core •  TBB, OpenMP*, Cilk Plus,

Pthreads used directly within MPI processes

•  Programmed as heterogeneous network of homogeneous nodes:

       

CPU  

       MIC  

       

CPU  

       MIC  

       

CPU  

       MIC  

       

CPU  

       MIC  

Network  

Data  

Data  

Data  

Data  

MPI

Data  

Data  

Data  

Data  

MPI

Page 44: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Range of models to meet application needs

Foo(  )  Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Spectrum of Programming Models and Mindsets

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

Main(  )  Foo(  )  MPI_*(  )  

CPU

Many-core

Co-processor

Multi-Core Centric Many-Core Centric

MulB-­‐Core  Hosted  General  purpose  serial  and  parallel  

compu0ng  

Offload  Codes  with  highly-­‐  parallel  phases  

Many  Core  Hosted  Highly-­‐parallel  codes  

Symmetric  Codes  with  balanced  

needs  

CPU Coprocessor

Page 45: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Other brands and names are the property of their respective owners.

A GROWING ECOSYSTEM: Developing today on Intel® Xeon Phi™ coprocessors

45

Approved for Public Presentation

Page 46: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Software, Drivers, Tools & Online Resources

Tools & Software Downloads

Getting Started Development Guides

Video Workshops, Tutorials, & Events

Code Samples & Case Studies

Articles, Forums, & Blogs

Associated Product Links

http://software.intel.com/mic-developer

Page 47: Productive parallel programming for intel xeon phi coprocessors

Keys to Productive Parallel Performance

Determine the best platform target for your application •  Intel® Xeon® processors or Intel® Xeon Phi™ coprocessors

- or both

Choose the right Xeon-centric or MIC-centric model for your application

Vectorize your application

Parallelize your application • With MPI (or other multi-process model) • With threads (via Pthreads, TBB, Cilk Plus, OpenMP, etc.)

• Go asynchronous: overlap computation and communication

Maintain unified source code for CPU and coprocessors

Page 48: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Page 49: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2012 , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Xeon Phi logo, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

intel.com/software/products

Legal Disclaimer & Optimization Notice

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers

Page 50: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

This slide MUST be used with any slides removed from this presentation

Legal Disclaimers •  All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

•  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number

•  Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

•  Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization

•  No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security

•  Requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and system configuration. For more information, visit http://www.intel.com/technology/turboboost

•  Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute the instructions in the correct sequence. AES-NI is available on select Intel® processors. For availability, consult your reseller or system manufacturer. For more information, see http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/

•  Intel product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive (2002/95/EC, Annex A). No exemptions required

•  Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below 900ppm bromine and 900ppm chlorine.

•  Intel, Intel Xeon, Intel Core microarchitecture, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

•  Copyright © 2011, Intel Corporation. All rights reserved.

50

Page 51: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Legal Disclaimers: Performance •  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate

performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm.

•  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

•  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.

•  SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.

•  TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information. •  SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See

http://www.sap.com/benchmark for more information. •  INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR

OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

•  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

This slide MUST be used with any slides with performance data removed from this presentation

51 51

Page 52: Productive parallel programming for intel xeon phi coprocessors

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

WRF Configuration (Backup)

•  Measured by Intel 3/27/2013 (Endeavor Cluster) •  Runs in symmetric model. •  Citation: WRF Graphic and detail: John Michalakes 11/12/12; Hardware: 2

socket server with Intel® Xeon® processor E5-2670 (8C, 2.6GHz. 115W), each node equipped with one Intel® Xeon Phi™ coprocessor (SE10X B1, 61 core, 1.1GHz, 8GB @ 5.5GT/s) in a 8 node FDR cluster. WRF is available from the US National Center for Atmospheric Research in Boulder, Colorado

•  It is available from http://www.wrf-model.org/ •  All KNC optimizations are in the V3.5 svn today •  Results obtained under MPSS 3552, Compiler rev 146, MPI rev 30 on SE10X

B1 KNC (61c, 1.1Ghz, 5.5Gts) •  WRF CONUS2.5km workload available from

www.mmm.ucar.edu/wrf/WG2/bench/ •  Performance comparison is based upon average timestep, we ignore

initialization and post simulation file operations.