super 1 bob lucas university of southern california sept. 23, 2011 science pipeline allen d. malony...
TRANSCRIPT
SUPER1
Bob Lucas
University of Southern California
Sept. 23, 2011
Science Pipeline
Allen D. MalonyUniversity of Oregon
May 6, 2014
Support for this work was provided through the Scientific Discovery through Advanced Computing (SciDAC) program funded by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research
SUPER2
Fundamental Objectives
• SUPER funnels the rich intellectual products borne from a history of research and development in performance areas into an effective performance engineering center of mass for the SciDAC program
• SUPER pulls from prior investments by ASCR and others the technology and expertise that past efforts produced, especially with respect to methodologies, tools, and integration across performance engineering areas– measurement, analysis, modeling– program analysis, optimization and tuning– resilience
• SUPER focuses on integration of expertise for addressing performance engineering problems across the SciDAC landscape, leveraging the robust performance tools available
SUPER3
Pipeline to Tools/Technology Integration and Application
DOE funding
Other funding
Performance
Modeling
Reliability
Autotuning
Autotuning
Optimization
Resilience
Energy
TAU
mpiP
PBound
ActiveHarmony
CHiLL
Roofline
PAPI
PEBILPSiNtracer
GPTL
Tools /Technologies
RCRToolkit
Code analysis
Center of mass forperformance engineerng
ROSE
Orio
End-to-endIntegration
SciDACapplications
SUPER4
Performance Engineering Tools/Tech Integration
• SUPER focuses on integrating developed tools and technologies to build enhanced capabilities
SUPER5
End-to-endPerformance Optimization
• SUPER is establishing processes for applying integrated tools for end-to-end optimization
SUPER6
Tools and Technologies
• Performance– TAU Performance System, PAPI, mpiP, GPTL
• Power / Energy– PEBIL, PSiNtracer
• Autotuning– Active Harmony, CHiLL, Orio
• Resilience and source analysis• Modeling
– Pbound, Roofline
• Optimization
SUPER7
TAU Performance System
• Tuning and Analysis Utilities (20+ year project)• Performance problem solving framework for HPC• Integrated performance toolkit
– Multi-level performance instrumentation– Flexible and configurable performance measurement– Widely-ported performance profiling / tracing system– Performance data management and data mining– Open source (BSD-style license)
• Broad use in complex software, systems, applications• Long history of funding by DOE, NSF, and DoD
SUPER8
TAU’s Funding and Development Pipeline
Funding pipeline:2001 – 2011
PRIMA
MOGO
Vancouver
CCA
ZeptoOSSource code analysis (PDT)Performance data management (PerfDMF)
Automated source instrumentationModeling and computational QoS
Productive
Evolution Flexible performance measurements, Performance mapping in software layers
Kernel-level measurementRuntime scalable monitoring
KnowledgePerformance knowledgePerformance data mining (PerfExplorer)
Measurement infrastructure refactorTAU + Scalasca Score-PParallel performance visualizationAutomatic library wrapping
Heterogeneous performanceAccelerator analysis
POINT
Glassbox
Open source interoperationPerformance engineering
Cross-layerIntegration
DOE
NSF
SUPER9
TAU Technologies
TAU + Scalasca Score-P
ParaProf
TAUdb
PerfExplorer
SUPER10
Impact of TAUdb and PerfExplorer
TAUdb
CUDA
OpenCLCHiLL+ AH
Orio
ROSE
Geant4MPAS-O
CESM
PerfExplorerXGC1
SUPER11
End-to-End Performance Variability Analysis (CESM)
Use of GPTL(General PurposeTiming Library)•Lightweight profilingto bundle with app
•NSF + DOE funding
Couple with platform systems information•TAUdb extended tosupport this data
SUPER12
Geant4 Performance Analysis and Tuning
• Geant 4 is extremely important to the design and execution of HEG experiments– How to evolve design to best exploit
current/future architectures?– Geant4 tHEP and ASCR partnership
• Not a standard performanceanalysis/tuning scenario – Quantifying performance impact of
OO design choices– Class-based performance analysis
• polymorphism (same function name, many implementations)• virtual functions (what object types are functions invoked on?)
SUPER13
Using TAU in Geant4
• TAU collects data for Simplified Calorimeter experiment– Sampling profiles: low-overhead measurements of full-scale
experiments– Instrumentation-based: selectively instrumented classes and
functions to collect precise measurements for functions (and whole classes) identified through sampling
• Data stored in TAUdb (shared with physics collaborators)• New analysis enabled by TAUdb and PerfExplorer
– Class-based profiles: hardware counters and derived metrics– Compare impacts of high-level and low-level optimizations
• changing inheritance structure (design) (high)• performance metrics (cache misses, vectorization, …) (low)
SUPER14
Performance API (PAPI)
• PAPI is middleware that provides a consistent interface and methodology for the performance counter hardware in major microprocessors
• PAPI enables software engineersto see the relation betweensoftware performance andhardware events
• PAPI component architectureprovides access to a collectionof components that exposeperformance measurementopportunities across the system– network, I/O system, accelerators,
power/energy
SUPER15
PAPI Pipeline
•DOE support
– ASCR (2002-05)
– PERC (2001-06)
– PERI (2006-11)
•PAPI is widelyavailable onprocessors and isheavily used inSUPER across areas
PaRSEC (UTK) TAU (UO)
PerfSuite (NCSA) HPCToolkit (Rice)
SCALASCA (FZJ, UTK) VampirVampir (TUD)
Open|Speedshop (LLNL) SvPablo (RENCI)
SUPER16
Performance Analysis for Communication (mpiP)
• Lightweight and scalable profiling toolfor MPI applications
• DOE funding history– ASC, PERC, PERI
• SUPER is extending mpiP to collectcommunication topology information forpoint-to-point and collective communication– SciDAC application characterization studies– Benchmarks and applications from
DOE-funded Oxbow project
• Developing an automated approach forcharacterizing the communication topology
LAMMPS
LULESH
SUPER17
Analyzing and Modeling Performance and Power
• How can we get energy efficient HPC? • Understand and model how computation and
communication affect the overall performance and energy requirements of HPC applications
• Use performance and power models to design software and hardware-aware “green” techniques to optimize energy footprint
• PEBIL and PSiNtracer (PMaC Labs)• RCRToolkit (RENCI)
SUPER18
Analysis and Modelingwith PEBIL and PSiNstracer
• Capture fundamental operations used by the application – Requires low-leve, specific details of application– Analysis required on large-scale production codes
• PEBIL binary instrumentation– Static analysis (memory, FP counts, op parallelism, …)– Dynamic ( cache hits, execution counts, loop length, …)
• PSiNstracer communication characterization– Profiles all communication routines during a run
• Funding heritage– DOE (ASCR, PERC, PERI)– DoD, NSF
SUPER19
RCRToolkit for Runtime Resource Monitoring
• Resource Centric Reflection (RCR) Toolkit– Node-wide performance monitoring and analysis– Uncore (“outside the core”)– Access through shared blackboard (RCRblackboard)
• Funding pipeline– DoD ACS MAESTRO and ATPER– DOE (XGC, XPRESS)– NSF GENI
• Impact– Adaptive scheduling for power and energy– Target deterministic strategies for (auto)tuning– SciDAC end applications amenable to using
SUPER20
Autotuning Pipeline
• SUPER brings several research efforts together to enable the use and integration of automatic tuning methods and tools– Active Harmony (University of Maryland)– CHiLL (Utah, USC)– Orio (Argonne, UO)
• Powerful capability for performance engineering– Parameter exploration automation– Couple with code transformation techniques
• Impact can be significant in improving ability to explore multi-dimensional performance space
SUPER21
Active Harmony
• Active Harmony (AH) is an auto-tuning framework that supports online and offline auto-tuning– Flexible, plugin-based architecture
• How does it works?– Measures program performance– Adapts tunable parameters– Search heuristics explore options
• Development funding pipeline– NSF (1997–2000)– DOD (1997–2000, 2010–present)– DOE (ASCR, 2001–2012)– DOE (SciDAC, 2001–present)
ActiveHarmony321
ClientApplication
Candidate Points
Evaluated Performance
SearchStrategy
FETCH REPO
RT
123
SUPER22
Active Harmony Integration
• CHiLL integration– Plugin used to access AH search methods– Explores performance space from code generation
• TAU integration– Plugin used within AH to read from / write to TAUdb– TAU used with CHiLL and AH to capture
performance
• Application– Used with MPAS-O (partitioning optimization)– Developed new auto-tuned FFT (1.8x faster than FFTW)
SUPER23
CHiLL Autotuning Pipeline
• CHiLL autotuning system developed in PERI (Utah)– Compiler framework for loop transformations– Integrated into the PERI autotuning framework– Integrated this in SciDAC with
other research at Utah
• Funding pipeline– NSF NGS (2002)– NSF CSR (2005)– DOE PERI (2006)– DOE ASCR XTUNE (2008)
• Broadening the autotuning research agenda in SUPER– Heterogeneous systems– Other objectives, in particular energy and resilience
SUPER24
Orio Autotuning Framework
• Express any properties of the computation that can possiblybe exploited to optimize
• Orio approach– Optimization specifications
• capture typical optimizations– tiling, unrolling, …
• specialized implementations– different input sizes
– Transform code based on knowledge• CUDA, OpenCL, OpenMP, …
– Empirical analysis of variants (different code output)– Search for best
• Orio integration with TAU for empirical autotuning\• SUPER impact on PETSc and other libraries
SUPER25
Modeling through Source and Empirical Analysis
• Performance bounds give the upper limit in performance that can be expected for a given application on a given system
• Different existing approaches:– Fully automatic (ignores machine information)– Theoretical peak (based on FP units)– Fully dynamic (profiling-based, time, overhead)
• Pbound approach (Argonne)– Application signatures + architecture bounds
• Roofline modeling (LBL)
SUPER26
PBound
• Developed under PERC, PERI, and SUPER– ROSE-based tool that generates performance bounds from
source code (C, C++, Fortran)– Example: what is the best achievable execution time?
• Based on static (source code) analysis– Produces parameterized closed-form expressions
expressing the computational and data load/store requirements of application kernels
• Coupled with architectural information– Produces upper bounds on the performance of the
application
SUPER27
Roofline Modeling
• Roofline models characterize architectures and help visualize application performance within the architectural roofline– Shows the range of possible application performance– Determines how optimizations affect application
performance• Performance space determined by either:
– Static performance models• such as those generated by Pbound
– Empirical models based upon platform experiments
SUPER28
Resilience Pipeline
• Express knowledge of application requirements– Semiconductor Research Corporation (SRC)– Multiscale Systems (MUSYC) Focused Center Research Program (FCRP)
• New grant from ARO– Transition technology into the ROSE compiler (LLNL)– Create runtime system based on JPL technology
• Additional NSF and SRC funding with Utah– Automatic derivation of predicates– Help detect silent errors
• Hardware component based FPGAs– Use FPGAs as co-processors– Originally funded by DARPA under the ACS (Adaptive Computing Systems)
• Work continues in SUPER– Collaborating with LLNL’s resilience research team– Broaden the space of applications and assertions
SUPER29
SUPER Science PipelineImpact and Outcomes
• Tools continue to improve and are widely distributed and downloaded
• 75 papers produced • 35 presentations among the institutions• 24 students matriculated and/or graduated• 4 postdocs• 10 internships at DOE national labs