introducing cray reveal - t&vs for “big data” as well as hpc ... the directive hybridisation...
TRANSCRIPT
C O M P U T E | S T O R E | A N A L Y Z E
Introducing Cray Reveal
A tool for accelerating directive development
9/24/2014 1
C O M P U T E | S T O R E | A N A L Y Z E
Who are Cray?
Since Its Founding, Cray Has Maintained a Single Focus on Supercomputing
1970 1980 1990 2010
2
2000
Copyright 2013 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Who are Cray
9/24/2014 3
● We are “The Supercomputer” company ● Long history of producing some of the fastest computers in the world.
● Cray have built some of the largest systems in the world. ● 3 of Top 10 systems in the world (as measured by Top 500)
● 3 of the Top 5 systems in the UK (ECMWF + ARCHER)
● “High Productivity Computing” via hardware and software
● Products for “Big Data” as well as HPC ● High capacity/high bandwidth storage via Sonexion Lustre file systems
● Specialist graph analytics hardware available via Urika platform.
C O M P U T E | S T O R E | A N A L Y Z E
The Supercomputing Company
Diverse products to meet the needs of the wider market
2011 2012 2013
4 Copyright 2013 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
Who are our customers?
5 Copyright 2013 Cray Inc.
C O M P U T E | S T O R E | A N A L Y Z E
What about the performance?
• Static finite element analysis
1 GF – 1988: Cray Y-MP; 8 Processors
• Modeling of metallic magnet atoms
1 TF – 1998: Cray T3E; 1,024 Processors
• Superconductive materials
1 PF – 2008: Cray XT5; 150,000 Processors
1 EF -- ~2018: Cray; ~10,000,000 Processors
8
C O M P U T E | S T O R E | A N A L Y Z E
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
202 408 722 808 1,245 1,073 1,644 1,847 2,230 2,827 3,093 3,518
10,073
16,316 20,971
47,309
79,949
121,306
What about the concurrency?
● Not just from the number of cores
● Longer vector lengths
● Probably fixed length
● Systems made from different types of
“processors”
● Hybrids of scalar and vectors
● Requires new methods of
programming to fully exploit the
hardware
Average Number of Processor Cores per Supercomputer
(Top20 of Top500)
Source: www.top500.org
23 October 2012 AMMW03 9
C O M P U T E | S T O R E | A N A L Y Z E
What about the energy?
This one takes over 3x the energy!
Performing a 64-bit floating-point FMA:
893,500.288914668 x 43.90230564772498
= 39,226,722.78026233027699
+ 2.02789331400154
= 39,226,724.80815564
Or moving the three 64-bit
operands 20 mm across the die:
And loading the data from off chip takes > 10x more yet
● Multicore supplies many flops for free.
● Enhancing data locality is critical for energy efficiency
Which requires more energy?
24 September 2014 AMMW03 10
C O M P U T E | S T O R E | A N A L Y Z E
Multicore Challenges at Exascale
Power
• Traditional voltage scaling is over
• Power now a major design constraint
• Cost of ownership
• Driving significant changes in architecture
Concurrency
• A billion operations per clock
• Billions of refs in flight at all times
• Will require huge problems
• Need to exploit all available parallelism
Programming Difficulty
• Concurrency and new micro-architectures will significantly complicate software
• Need to hide this complexity from the users
Resiliency
• Many more components
• Components getting less reliable
• Checkpoint bandwidth not scaling
• Impacts both systems and storage
24 September 2014 AMMW03 11
C O M P U T E | S T O R E | A N A L Y Z E
The Hybrid Approach
9/24/2014 12
● Majority of HPC parallelism is distributed memory e.g. ● Applications written using Message Passing (e.g. MPI)
● or Single sided RDMA (e.g. SHMEM)
● Mapped to multicore hardware as multiple process images ● Potential limited scalability as number of images increase
● Memory overheads increase per process
● Time and energy costs of communication increases
● Potentially difficult to load balance, especially dynamically
● Partitioned Global Address Space (PGAS) models offer a halfway house ● E.g. UPC, Fortran 2008 Coarrays, Titanium
● May require complete code rewrite
● Developers are looking to shared memory models ● Directive shared memory parallelisation like OpenMP
● Allows applications to be retro-fitted
C O M P U T E | S T O R E | A N A L Y Z E
The Directive Hybridisation Cycle
13 Cray Inc.
Annotate
e.g OpenMP
Evaluate
/Debug Profile
ATP, STAT,
FTD,
Totalview
Cray
Performance
Analysis Toolkit
(CrayPAT)
Cray
Reveal
C O M P U T E | S T O R E | A N A L Y Z E
Reveal
Cray Inc.
Analysis and code
restructuring assistant…
Uses both the performance toolset and CCE’s program library functionality to provide static and runtime analysis information
Assists user with the code optimization phase by correlating source code with analysis to help identify which areas are key candidates for optimization
Key Features
Annotated source code with compiler optimization information
• Provides feedback on critical dependencies that prevent optimizations
Scoping analysis
• Identifies shared, private and ambiguous arrays
• Allows user to privatize ambiguous arrays
• Allows user to override dependency analysis
Source code navigation
• Uses performance data collected through CrayPat
14
C O M P U T E | S T O R E | A N A L Y Z E
Visualize CCE’s Loopmark with Performance Profile
Performance
feedback
Loopmark and optimization
annotations
Compiler feedback
16
C O M P U T E | S T O R E | A N A L Y Z E17
Visualize CCE’s Loopmark with Performance Profile (2)
Integrated
message
‘explain support’
Integrated
message
‘explain support’
C O M P U T E | S T O R E | A N A L Y Z E
View Pseudo Code for Inlined Functions
18
Inlined call
sites marked
Expand to
see pseudo
code
C O M P U T E | S T O R E | A N A L Y Z E
Scoping Assistance – Review Scoping Results
User addresses
parallelization
issues for
unresolved
variables
Loops with
scoping
information are
highlighted – red
needs user
assistance
Parallelization inhibitor
messages are provided to
assist user with analysis
19
C O M P U T E | S T O R E | A N A L Y Z E
Scoping Assistance – User Resolves Issues
Click on variable to
view all
occurrences in loop Use Reveal’s
OpenMP
parallelization tips
20