recent accomplishments/use cases

Recent Accomplishments/Use Cases

4

Static Compiler AnalysisCy Chan and Didem Unat

(joint work with ExaCT)

• Developed on top of the ROSE framework• For each module/subroutine in input code

– Collect function level information• Local and non-local variables• Data arrays: read-only, write-only, read-write• Communications between processes

– Collect loop level information• Iteration range and stride• Arithmetic operation counts• State variables: locals, parameters, arguments• Data arrays: access types and patterns

• Currently takes Fortran codes and outputs an XML

5

Proxy Applications

• SMC is compressible Navier Stokes solver– Uses the same discretization approach as the

petascale application code S3D– Simplified set of boundary conditions– Uses eight-order finite difference approximation in

space and a low-storage Runge-Kutta algorithm in time.

• CNS is compressible Navier Stokes– Does not include chemical species – Uses constant coefficient transport properties

6

Even though transcendentals and division ops might be low in count, they can dominate the CPU time7

SMC code with 53 species

1 10 100 1000 100001

10

100

1000

10000

9 Species21 Species53 Species71 Species107 Species

Variable Rank

Ac

ce

ss

es

x86 has 16 FP named registers!

Allocate to registers

Leave in L1 cache

Register Requirements for SMC

Chemistry FP State Variables by Rank

8

Impact of Software Optimization on CNS Cache/Bandwidth Trade-Off

2 20 200 2000 200000.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

Baseline + Best Blocking + Best Fusion

Cache Size (kB)

By

tes

pe

r F

lop

Up to 45% improvement for blocking and 90% for

fusion

9

SMC Performance ProjectionsNeither software optimizations alone nor hardware optimizations alone

will not get us to the exascale, we have to apply both.

10

Neither software optimizations alone nor hardware optimizations alone will not get us to the exascale, we have to apply both.

9 21 53 71 1070

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5Estimated Performance Improvements

+Fast NIC (400 GB/s)

+Fast-exp

+Fast-div

+Fast memory (4 TB/s)

+Loop fusion

+Cache blocking

Baseline

Number of Species

Te

rafl

op

s

U.iryn

Q.qwU.im

y

Unew.ir

yn

Unew.ie

ne

Unew.im

z

Ddiag

.ndp

y.n ux wz uy vz

Q.qrh

o

Fdif.im

y

Fdif.ir

ho

Fhyp.

imx

Fhyp.

irho

Hg.im

x

Hg.iry

n

Uprim

e.im

y

Uprim

e.iry

nQ.q

e0

1

2

3

4

5

6

7

8

9

10

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

0.70%

0.80%

0.90%Read/Write Ratio Write Access Rate

Re

ad

/Wri

te R

ati

o W

rite A

cc

es

s R

ate

• Ranking arrays by the read/write ratios and write access rates• NVRAM is not friendly to write references

11

U.iryn

Q.qwU.im

y

Unew.ir

yn

Unew.ie

ne

Unew.im

z

Ddiag

.ndp

y.n ux wz uy vz

Q.qrh

o

Fdif.im

y

Fdif.ir

ho

Fhyp.

imx

Fhyp.

irho

Hg.im

x

Hg.iry

n

Uprim

e.im

y

Uprim

e.iry

nQ.q

e0

1

2

3

4

5

6

7

8

9

10

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

0.70%

0.80%

0.90%Read/Write Ratio Write Access Rate

Re

ad

/Wri

te R

ati

o W

rite A

cc

es

s R

ate

ReadWrite Ratio > 5, NVRAM percentage

35%

• Using RW ratio might be misleading. RW ratio > 5 goes to NVRAM (only 35% of the data)

• If a write access rate of <=0.11% is chosen, then 75% of memory footprint qualifies for NVRAM

– Roughly 75% idle power savings but the dynamic energy will go up by a factor of 4 to 40– Need power simulations for more accurate results

Write Access Rate <= 0.11%,

NVRAM percentage 75%

12

Auto-Skeletonization(LLNL and Sandia)

• GOAL: Generate a reduced program that retains the performance characteristics of the original.

• Reduced program ideal for use in simulation environment : remove computations that aren’t relevant with respect to performance aspect of interest (e.g., message passing performance).

• Our approach uses static analysis and code transformation via ROSE with user guidance for fine tuning skeletonization process.

13

Analysis and Simulation Components

Static analysis

• Static single assignment form provided by ROSE defines data dependency information for program.

• Resolution of interprocedural data dependencies based on transitive closure of program call graph.

• Data dependency graph is labeled based on role data plays with respect to an API (such as MPI).– Roles include “payload”, “comm. topology”, etc.– Allows dependencies to be treated differently depending on how they

affect the API calls.

• Annotation guidance to inject user-knowledge not available purely from static analysis.– Many properties are data dependent. E.g.: expected iteration counts

for iterative solvers. – Annotations were important in applying skeletonizer to real programs.

Preliminary results

• Three case studies in paper: 2D FFT; 2D Jacobi; Sandia Mantevo HPCCG benchmark.

• Summary: – All generated skeletons were smaller and ran faster than

original code under simulation with the SST/Macro simulator.– Skeletons matched hand-coded skeletons for cases where

these existed (HPCCG, FFT).

FFT HPCCG Jacobi

Current work

• Whole-program dependency analysis for programs with many distinct compilation units.

• Working with pre-processor and conditional compilation.– NAS benchmarks is complicated by problem size

selection occurring at compiler time. Would like to create one skeleton, not one per problem size.

• Continuation of memory footprint skeleton generator initiated in 2012. Allows a second performance dimension to be studied in addition to message passing performance.

Mixed Model Simulation(Sandia & LBNL)

Accomplishment: Create flexible, modular, interoperable simulation environment using SystemC Industry standard– More agile environment to rapidly configure experiments to

answer questions posed by vendors and CoDesign centers– Enables accurate multiscale evaulation of energy costs for data

movement

18

NVRAM Simulation(LBNL and U. Penn)

Accomplishment: Created validated simulator for NVRAM components using Micron and Samsung data– Enable study of in-situ I/O– Motivated serve SDMA work in ExaCT Codesign Center

19

NVRAM Simulation for Advanced PRAM Architecture Studies

Architecture study of multi-layer cross-point RRAM architecture (In collaboration with HP Labs and Adesto)

• High-density (shared wordlines and bitlines reduce area cost)

• Parallel data access among multiple layers• Bi-group operation scheme

Identified new memory organization for RRAM that overcomes traditional sources of failure and demonstrated its effectiveness with architectural simulation

Paper accepted for publication in Usenix FAST13 conference on Design of Large Scale Storage

Arch Study of Multi-layer Cross-point RRAM Architecture• High-density (shared wordlines & bitlines reduce area cost)• Parallel data access among multiple layers• Bi-group operation scheme

20

Proposed multi-slab RRAM Architecture

Simulation of conventional single slab RRAM• Limited cell read

parallelism• Spends majority of time on

cell reads

Simulation of multi-slab RRAM• Drastically improves

internal bandwidth• Saturates bus channel

and avoids common• Failure modes for

memresistive technology

Fault Injection for Resilience Research(LLNL and LBNL)

• Accomplishment: Create flexible hardware simulation environment to support fault resilience research– Helps ROSE Triple Modular Redundancy (TMR) Research– Will support emerging Resilience research projects

21

Thank You!

Please visit

http://www.nersc.gov/projects/CoDEx

For more information!

22

http://www.nersc.gov/projects/CoDEx

recent accomplishments/use cases

Documents

hardware optimizations

write access rate

rw ratio

access types

argumentsdata arrays

access ratesnvram

ranking arrays

references10readwrite