super 1 bob lucas university of southern california sept. 23, 2011 science pipeline allen d. malony...

SUPER1

Bob Lucas

University of Southern California

Sept. 23, 2011

Science Pipeline

Allen D. MalonyUniversity of Oregon

May 6, 2014

Support for this work was provided through the Scientific Discovery through Advanced Computing (SciDAC) program funded by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research

SUPER2

Fundamental Objectives

• SUPER funnels the rich intellectual products borne from a history of research and development in performance areas into an effective performance engineering center of mass for the SciDAC program

• SUPER pulls from prior investments by ASCR and others the technology and expertise that past efforts produced, especially with respect to methodologies, tools, and integration across performance engineering areas– measurement, analysis, modeling– program analysis, optimization and tuning– resilience

• SUPER focuses on integration of expertise for addressing performance engineering problems across the SciDAC landscape, leveraging the robust performance tools available

SUPER3

Pipeline to Tools/Technology Integration and Application

DOE funding

Other funding

Performance

Modeling

Reliability

Autotuning

Autotuning

Optimization

Resilience

Energy

TAU

mpiP

PBound

ActiveHarmony

CHiLL

Roofline

PAPI

PEBILPSiNtracer

GPTL

Tools /Technologies

RCRToolkit

Code analysis

Center of mass forperformance engineerng

ROSE

Orio

End-to-endIntegration

SciDACapplications

SUPER4

Performance Engineering Tools/Tech Integration

• SUPER focuses on integrating developed tools and technologies to build enhanced capabilities

SUPER5

End-to-endPerformance Optimization

• SUPER is establishing processes for applying integrated tools for end-to-end optimization

SUPER6

Tools and Technologies

• Performance– TAU Performance System, PAPI, mpiP, GPTL

• Power / Energy– PEBIL, PSiNtracer

• Autotuning– Active Harmony, CHiLL, Orio

• Resilience and source analysis• Modeling

– Pbound, Roofline

• Optimization

SUPER7

TAU Performance System

• Tuning and Analysis Utilities (20+ year project)• Performance problem solving framework for HPC• Integrated performance toolkit

– Multi-level performance instrumentation– Flexible and configurable performance measurement– Widely-ported performance profiling / tracing system– Performance data management and data mining– Open source (BSD-style license)

• Broad use in complex software, systems, applications• Long history of funding by DOE, NSF, and DoD

SUPER8

TAU’s Funding and Development Pipeline

Funding pipeline:2001 – 2011

PRIMA

MOGO

Vancouver

CCA

ZeptoOSSource code analysis (PDT)Performance data management (PerfDMF)

Automated source instrumentationModeling and computational QoS

Productive

Evolution Flexible performance measurements, Performance mapping in software layers

Kernel-level measurementRuntime scalable monitoring

KnowledgePerformance knowledgePerformance data mining (PerfExplorer)

Measurement infrastructure refactorTAU + Scalasca Score-PParallel performance visualizationAutomatic library wrapping

Heterogeneous performanceAccelerator analysis

POINT

Glassbox

Open source interoperationPerformance engineering

Cross-layerIntegration

DOE

NSF

SUPER9

TAU Technologies

TAU + Scalasca Score-P

ParaProf

TAUdb

PerfExplorer

SUPER10

Impact of TAUdb and PerfExplorer

TAUdb

CUDA

OpenCLCHiLL+ AH

Orio

ROSE

Geant4MPAS-O

CESM

PerfExplorerXGC1

SUPER11

End-to-End Performance Variability Analysis (CESM)

Use of GPTL(General PurposeTiming Library)•Lightweight profilingto bundle with app

•NSF + DOE funding

Couple with platform systems information•TAUdb extended tosupport this data

SUPER12

Geant4 Performance Analysis and Tuning

• Geant 4 is extremely important to the design and execution of HEG experiments– How to evolve design to best exploit

current/future architectures?– Geant4 tHEP and ASCR partnership

• Not a standard performanceanalysis/tuning scenario – Quantifying performance impact of

OO design choices– Class-based performance analysis

• polymorphism (same function name, many implementations)• virtual functions (what object types are functions invoked on?)

SUPER13

Using TAU in Geant4

• TAU collects data for Simplified Calorimeter experiment– Sampling profiles: low-overhead measurements of full-scale

experiments– Instrumentation-based: selectively instrumented classes and

functions to collect precise measurements for functions (and whole classes) identified through sampling

• Data stored in TAUdb (shared with physics collaborators)• New analysis enabled by TAUdb and PerfExplorer

– Class-based profiles: hardware counters and derived metrics– Compare impacts of high-level and low-level optimizations

• changing inheritance structure (design) (high)• performance metrics (cache misses, vectorization, …) (low)

SUPER14

Performance API (PAPI)

• PAPI is middleware that provides a consistent interface and methodology for the performance counter hardware in major microprocessors

• PAPI enables software engineersto see the relation betweensoftware performance andhardware events

• PAPI component architectureprovides access to a collectionof components that exposeperformance measurementopportunities across the system– network, I/O system, accelerators,

power/energy

SUPER15

PAPI Pipeline

•DOE support

– ASCR (2002-05)

– PERC (2001-06)

– PERI (2006-11)

•PAPI is widelyavailable onprocessors and isheavily used inSUPER across areas

PaRSEC (UTK) TAU (UO)

PerfSuite (NCSA) HPCToolkit (Rice)

SCALASCA (FZJ, UTK) VampirVampir (TUD)

Open|Speedshop (LLNL) SvPablo (RENCI)

SUPER16

Performance Analysis for Communication (mpiP)

• Lightweight and scalable profiling toolfor MPI applications

• DOE funding history– ASC, PERC, PERI

• SUPER is extending mpiP to collectcommunication topology information forpoint-to-point and collective communication– SciDAC application characterization studies– Benchmarks and applications from

DOE-funded Oxbow project

• Developing an automated approach forcharacterizing the communication topology

LAMMPS

LULESH

SUPER17

Analyzing and Modeling Performance and Power

• How can we get energy efficient HPC? • Understand and model how computation and

communication affect the overall performance and energy requirements of HPC applications

• Use performance and power models to design software and hardware-aware “green” techniques to optimize energy footprint

• PEBIL and PSiNtracer (PMaC Labs)• RCRToolkit (RENCI)

SUPER18

Analysis and Modelingwith PEBIL and PSiNstracer

• Capture fundamental operations used by the application – Requires low-leve, specific details of application– Analysis required on large-scale production codes

• PEBIL binary instrumentation– Static analysis (memory, FP counts, op parallelism, …)– Dynamic ( cache hits, execution counts, loop length, …)

• PSiNstracer communication characterization– Profiles all communication routines during a run

• Funding heritage– DOE (ASCR, PERC, PERI)– DoD, NSF

SUPER19

RCRToolkit for Runtime Resource Monitoring

• Resource Centric Reflection (RCR) Toolkit– Node-wide performance monitoring and analysis– Uncore (“outside the core”)– Access through shared blackboard (RCRblackboard)

• Funding pipeline– DoD ACS MAESTRO and ATPER– DOE (XGC, XPRESS)– NSF GENI

• Impact– Adaptive scheduling for power and energy– Target deterministic strategies for (auto)tuning– SciDAC end applications amenable to using

SUPER20

Autotuning Pipeline

• SUPER brings several research efforts together to enable the use and integration of automatic tuning methods and tools– Active Harmony (University of Maryland)– CHiLL (Utah, USC)– Orio (Argonne, UO)

• Powerful capability for performance engineering– Parameter exploration automation– Couple with code transformation techniques

• Impact can be significant in improving ability to explore multi-dimensional performance space

SUPER21

Active Harmony

• Active Harmony (AH) is an auto-tuning framework that supports online and offline auto-tuning– Flexible, plugin-based architecture

• How does it works?– Measures program performance– Adapts tunable parameters– Search heuristics explore options

• Development funding pipeline– NSF (1997–2000)– DOD (1997–2000, 2010–present)– DOE (ASCR, 2001–2012)– DOE (SciDAC, 2001–present)

ActiveHarmony321

ClientApplication

Candidate Points

Evaluated Performance

SearchStrategy

FETCH REPO

RT

123

SUPER22

Active Harmony Integration

• CHiLL integration– Plugin used to access AH search methods– Explores performance space from code generation

• TAU integration– Plugin used within AH to read from / write to TAUdb– TAU used with CHiLL and AH to capture

performance

• Application– Used with MPAS-O (partitioning optimization)– Developed new auto-tuned FFT (1.8x faster than FFTW)

SUPER23

CHiLL Autotuning Pipeline

• CHiLL autotuning system developed in PERI (Utah)– Compiler framework for loop transformations– Integrated into the PERI autotuning framework– Integrated this in SciDAC with

other research at Utah

• Funding pipeline– NSF NGS (2002)– NSF CSR (2005)– DOE PERI (2006)– DOE ASCR XTUNE (2008)

• Broadening the autotuning research agenda in SUPER– Heterogeneous systems– Other objectives, in particular energy and resilience

SUPER24

Orio Autotuning Framework

• Express any properties of the computation that can possiblybe exploited to optimize

• Orio approach– Optimization specifications

• capture typical optimizations– tiling, unrolling, …

• specialized implementations– different input sizes

– Transform code based on knowledge• CUDA, OpenCL, OpenMP, …

– Empirical analysis of variants (different code output)– Search for best

• Orio integration with TAU for empirical autotuning\• SUPER impact on PETSc and other libraries

SUPER25

Modeling through Source and Empirical Analysis

• Performance bounds give the upper limit in performance that can be expected for a given application on a given system

• Different existing approaches:– Fully automatic (ignores machine information)– Theoretical peak (based on FP units)– Fully dynamic (profiling-based, time, overhead)

• Pbound approach (Argonne)– Application signatures + architecture bounds

• Roofline modeling (LBL)

SUPER26

PBound

• Developed under PERC, PERI, and SUPER– ROSE-based tool that generates performance bounds from

source code (C, C++, Fortran)– Example: what is the best achievable execution time?

• Based on static (source code) analysis– Produces parameterized closed-form expressions

expressing the computational and data load/store requirements of application kernels

• Coupled with architectural information– Produces upper bounds on the performance of the

application

SUPER27

Roofline Modeling

• Roofline models characterize architectures and help visualize application performance within the architectural roofline– Shows the range of possible application performance– Determines how optimizations affect application

performance• Performance space determined by either:

– Static performance models• such as those generated by Pbound

– Empirical models based upon platform experiments

SUPER28

Resilience Pipeline

• Express knowledge of application requirements– Semiconductor Research Corporation (SRC)– Multiscale Systems (MUSYC) Focused Center Research Program (FCRP)

• New grant from ARO– Transition technology into the ROSE compiler (LLNL)– Create runtime system based on JPL technology

• Additional NSF and SRC funding with Utah– Automatic derivation of predicates– Help detect silent errors

• Hardware component based FPGAs– Use FPGAs as co-processors– Originally funded by DARPA under the ACS (Adaptive Computing Systems)

• Work continues in SUPER– Collaborating with LLNL’s resilience research team– Broaden the space of applications and assertions

SUPER29

SUPER Science PipelineImpact and Outcomes

• Tools continue to improve and are widely distributed and downloaded

• 75 papers produced • 35 presentations among the institutions• 24 students matriculated and/or graduated• 4 postdocs• 10 internships at DOE national labs

super 1 bob lucas university of southern california sept. 23, 2011 science pipeline allen d. malony...

Documents

performance areas

performance mapping

tau performance system

scidac program super

tuning resilience super

endtoend optimization

roofline optimization

dod slide