introducing cray reveal - t&vs for “big data” as well as hpc ... the directive hybridisation...

21
COMPUTE | STORE | ANALYZE Introducing Cray Reveal A tool for accelerating directive development [email protected] 9/24/2014 1

Upload: lythuy

Post on 11-Mar-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

C O M P U T E | S T O R E | A N A L Y Z E

Introducing Cray Reveal

A tool for accelerating directive development

[email protected]

9/24/2014 1

C O M P U T E | S T O R E | A N A L Y Z E

Who are Cray?

Since Its Founding, Cray Has Maintained a Single Focus on Supercomputing

1970 1980 1990 2010

2

2000

Copyright 2013 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Who are Cray

9/24/2014 3

● We are “The Supercomputer” company ● Long history of producing some of the fastest computers in the world.

● Cray have built some of the largest systems in the world. ● 3 of Top 10 systems in the world (as measured by Top 500)

● 3 of the Top 5 systems in the UK (ECMWF + ARCHER)

● “High Productivity Computing” via hardware and software

● Products for “Big Data” as well as HPC ● High capacity/high bandwidth storage via Sonexion Lustre file systems

● Specialist graph analytics hardware available via Urika platform.

C O M P U T E | S T O R E | A N A L Y Z E

The Supercomputing Company

Diverse products to meet the needs of the wider market

2011 2012 2013

4 Copyright 2013 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

XC30 Compute Blade

6

C O M P U T E | S T O R E | A N A L Y Z E

What has Multicore ever done for us?

9/24/2014 7

C O M P U T E | S T O R E | A N A L Y Z E

What about the performance?

• Static finite element analysis

1 GF – 1988: Cray Y-MP; 8 Processors

• Modeling of metallic magnet atoms

1 TF – 1998: Cray T3E; 1,024 Processors

• Superconductive materials

1 PF – 2008: Cray XT5; 150,000 Processors

1 EF -- ~2018: Cray; ~10,000,000 Processors

8

C O M P U T E | S T O R E | A N A L Y Z E

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

202 408 722 808 1,245 1,073 1,644 1,847 2,230 2,827 3,093 3,518

10,073

16,316 20,971

47,309

79,949

121,306

What about the concurrency?

● Not just from the number of cores

● Longer vector lengths

● Probably fixed length

● Systems made from different types of

“processors”

● Hybrids of scalar and vectors

● Requires new methods of

programming to fully exploit the

hardware

Average Number of Processor Cores per Supercomputer

(Top20 of Top500)

Source: www.top500.org

23 October 2012 AMMW03 9

C O M P U T E | S T O R E | A N A L Y Z E

What about the energy?

This one takes over 3x the energy!

Performing a 64-bit floating-point FMA:

893,500.288914668 x 43.90230564772498

= 39,226,722.78026233027699

+ 2.02789331400154

= 39,226,724.80815564

Or moving the three 64-bit

operands 20 mm across the die:

And loading the data from off chip takes > 10x more yet

● Multicore supplies many flops for free.

● Enhancing data locality is critical for energy efficiency

Which requires more energy?

24 September 2014 AMMW03 10

C O M P U T E | S T O R E | A N A L Y Z E

Multicore Challenges at Exascale

Power

• Traditional voltage scaling is over

• Power now a major design constraint

• Cost of ownership

• Driving significant changes in architecture

Concurrency

• A billion operations per clock

• Billions of refs in flight at all times

• Will require huge problems

• Need to exploit all available parallelism

Programming Difficulty

• Concurrency and new micro-architectures will significantly complicate software

• Need to hide this complexity from the users

Resiliency

• Many more components

• Components getting less reliable

• Checkpoint bandwidth not scaling

• Impacts both systems and storage

24 September 2014 AMMW03 11

C O M P U T E | S T O R E | A N A L Y Z E

The Hybrid Approach

9/24/2014 12

● Majority of HPC parallelism is distributed memory e.g. ● Applications written using Message Passing (e.g. MPI)

● or Single sided RDMA (e.g. SHMEM)

● Mapped to multicore hardware as multiple process images ● Potential limited scalability as number of images increase

● Memory overheads increase per process

● Time and energy costs of communication increases

● Potentially difficult to load balance, especially dynamically

● Partitioned Global Address Space (PGAS) models offer a halfway house ● E.g. UPC, Fortran 2008 Coarrays, Titanium

● May require complete code rewrite

● Developers are looking to shared memory models ● Directive shared memory parallelisation like OpenMP

● Allows applications to be retro-fitted

C O M P U T E | S T O R E | A N A L Y Z E

The Directive Hybridisation Cycle

13 Cray Inc.

Annotate

e.g OpenMP

Evaluate

/Debug Profile

ATP, STAT,

FTD,

Totalview

Cray

Performance

Analysis Toolkit

(CrayPAT)

Cray

Reveal

C O M P U T E | S T O R E | A N A L Y Z E

Reveal

Cray Inc.

Analysis and code

restructuring assistant…

Uses both the performance toolset and CCE’s program library functionality to provide static and runtime analysis information

Assists user with the code optimization phase by correlating source code with analysis to help identify which areas are key candidates for optimization

Key Features

Annotated source code with compiler optimization information

• Provides feedback on critical dependencies that prevent optimizations

Scoping analysis

• Identifies shared, private and ambiguous arrays

• Allows user to privatize ambiguous arrays

• Allows user to override dependency analysis

Source code navigation

• Uses performance data collected through CrayPat

14

C O M P U T E | S T O R E | A N A L Y Z E

Reveal with Loop Work Estimates

CScADS 2012 Cray Inc. 15

C O M P U T E | S T O R E | A N A L Y Z E

Visualize CCE’s Loopmark with Performance Profile

Performance

feedback

Loopmark and optimization

annotations

Compiler feedback

16

C O M P U T E | S T O R E | A N A L Y Z E17

Visualize CCE’s Loopmark with Performance Profile (2)

Integrated

message

‘explain support’

Integrated

message

‘explain support’

C O M P U T E | S T O R E | A N A L Y Z E

View Pseudo Code for Inlined Functions

18

Inlined call

sites marked

Expand to

see pseudo

code

C O M P U T E | S T O R E | A N A L Y Z E

Scoping Assistance – Review Scoping Results

User addresses

parallelization

issues for

unresolved

variables

Loops with

scoping

information are

highlighted – red

needs user

assistance

Parallelization inhibitor

messages are provided to

assist user with analysis

19

C O M P U T E | S T O R E | A N A L Y Z E

Scoping Assistance – User Resolves Issues

Click on variable to

view all

occurrences in loop Use Reveal’s

OpenMP

parallelization tips

20

C O M P U T E | S T O R E | A N A L Y Z E

Scoping Assistance – Generate Directive

Automatically

generate

OpenMP

directive

Reveal generates

example OpenMP

directive

21