multicoreware parallel path analyzer...
TRANSCRIPT
MulticoreWare
Parallel Path Analyzer (PPA)
Hui Huang, Chunpeng Zhang, Yao Wang, Lihua Zhang
6/13/11 Copyright (C) 2011 MulticoreWare Inc
• What is Parallel Path Analyzer?
• PPA overview
• Developer Library overview
• Story 1: Instrument codes with the developer library
• Story 2: Data collecting and viewing
• Story3: Data post-processing
• Performance optimization example with PPA
• Other outlined features & future enhancements
• Q&A
6/13/11 2 Copyright (C) 2011 MulticoreWare Inc
Agenda
• A platform-independent standalone tool for analyzing applications on a heterogeneous computing system
• Identify system-wide performance bottlenecks and find critical paths of the applications, whether it is in CPU, GPU, or I/O
• Target both traditional discrete CPU/GPU based and APU-accelerated OpenCL & Non-OpenCL applications
• Visualize profiling data in intuitive graphs and generates meaningful numerical results
• Seamlessly integrate with other MulticareWare tools to provide comprehensive toolset to ease complex heterogeneous computing development
6/13/11 Copyright (C) 2011 MulticoreWare Inc
What is Parallel Path Analyzer?
• What is Parallel Path Analyzer?
• PPA overview
• Developer Library overview
• Story 1: Add user-level library
• Story 2: Events captured and viewed
• Story3: Data processing
• Performance optimization example with PPA
• Other outlined features & future enhancements
• Q&A
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Agenda
• Provide a user-level library for developer to instrument CPU code by manual and automatic ways
• Automatic capture of full trace of OpenCL API and commands – Common trace format with AMD tools
• A virtual global system clock mechanism is used to create time synchronized performance view across CPUs/GPUs
• Support debugging & statistic events
• User friendly GUI and comprehensive data views to help developer understand the behavior of the applications
6/13/11 Copyright (C) 2011 MulticoreWare Inc
PPA Overview – Key Features
• Simultaneous multiple applications profiling support – Complicated app which includes independent but dependent modules
that could be even developed by different developers, or even different vendors
– DirectShow filters based app is such an example
• Exclusive AMD OpenCL sub-kernel profiling & debugging capability (to be avail soon) – Identify load balancing issues between workgroup/wavefront
– Identify critical path within a kernel (which segments of code is bottleneck)
– Debug events allowed to help run-time debug
• Fusion Supports – power & bandwidth measurement (will be avail soon)
6/13/11 Copyright (C) 2011 MulticoreWare Inc
PPA Overview – Key Features
• Developer / User Level Library – PPA initialization APIs & profiling APIs
– Runtime DLL
• GUI – Provide user friendly interface for all major operations
– Provide comprehensive & intuitive graph view to visualize profiling data
6/13/11 Copyright (C) 2011 MulticoreWare Inc
PPA Overview – Main Components
6/13/11 Copyright (C) 2011 MulticoreWare Inc
PPA Overview – Main GUI
• Event – i.e. profiling event in PPA is a data structure, which has
• A unique event name / ID to identify itself
• An event type to identify its meaning, e.g. start, stop of a CPU event, debug event
• A globalized timestamp to record its occurrence time
• Other info depending on its type, e.g., – CPU event: Start time, stop time, core ID, thread ID & priority
– GPU event: device ID, queued time, submit time, start time, stop time of CL command
– Essential element to measure which has performance meaning, like start and stop of an subroutine in CPU code user is interested in
– Essential element to draw in PPA viewer
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Some PPA Terms
• Frame – In default all events are drawn in single view window
– For Bullet type of apps which is timestep based and same pipeline repeats every timestep.
– Performance is varying between frames and target is const perf over frames
• Session – A session records results from a single data collection operation
• including the raw profiling data dumped from app runtime as well post processing results
– PPA will automatically generate a new session for each data collection operation, and user can quickly guide between sessions to check / process historical data
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Some PPA Terms
• Bullet – An open source project for game physics simulation
– Its core feature is rigid body dynamics, but also support effect physics like particles & softbody
– We’ll use AMD Opencl version of Bullet particle demo to introduce PPA & demonstrate how PPA can be used to optimize a heterogeneous app, because,
• It uses both CPU & GPU, and there is dependency in between
• GPU is used for both simulation & rendering
• It is a natural frame based application
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Bullet as Sample
• What is Parallel Path Analyzer?
• PPA overview
• Developer Library overview
• Story 1: Instrument codes with the developer library
• Story 2: Data collecting and viewing
• Story3: Data post-processing
• Performance optimization example with PPA
• Other outlined features & future enhancements
• Q&A
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Agenda
Library files
• ppa.dll – PPA developer library (user-level runtime)
• ppa.h – Declarations of PPA_INIT and PPA_END,
– Other user level profiling APIs, like PPAStopCpuEventFunc(e)
• ppa.cpp – Implementation of PPA initialization
• ppaEventDefs.h – User-defined events
Copyright (C) 2011 MulticoreWare Inc
Developer Library Overview
• Define the events
PPA_DEFINE_EVENT(your-event-name)
• Declare & initialize the lib
PPA_DECLARE_EVENT;
PPA_INIT();
• Instrument the application codes by using paired routines
//beginning of the code block
ppaStartCpuEventFuc(your-event-name)
…
ppaStopCpuEventFunc(your-event-name)
//end of the code bock
• Release the lib
PPA_END();
6/13/11
Developer Library Overview
• What is Parallel Path Analyzer?
• PPA overview
• Developer Library overview
• Story 1: Instrument codes with the developer library
• Story 2: Data collecting and viewing
• Story3: Data post-processing
• Performance optimization example with PPA
• Other outlined features & future enhancements
• Q&A
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Agenda
#include "ppa.h"
PPA_DECLARE_EVENT;
int main(int argc,char** argv)
{
// Call PPA_INIT() to initialize ppaUtil
PPA_INIT();
ParticlesDemo pDemo(argc, argv);
pDemo.initPhysics();
pDemo.getDynamicsWorld()->setDebugDrawer(&gDebugDrawer);
glutmain(argc, argv,640,480,"Bullet Physics Demo. http://bulletphysics.com", &pDemo);
// Call PPA_END() to terminate ppaUtil
PPA_END();
return 0;
}
6/13/11
Developer Library Use in Bullet
Event declaration in ppaEventDefs.h
PPA_DEFINE_EVENT(btDemo_renderme)
:
PPA_DEFINE_EVENT(btPart_runIntegrateMotionKernel)
PPA_DEFINE_EVENT(btPart_runCollideParticlesKernel)
:
Profiling code in CPU code: void ParticlesDemo::renderme()
{
PPAStartCpuEventFunc(btDemo_renderme);
glColor3f(1.0, 1.0, 1.0);
:
PPAStopCpuEventFunc(btDemo_renderme);
}
6/13/11
Developer Library Use in Bullet
• What is Parallel Path Analyzer?
• PPA overview
• Developer Library overview
• Story 1: Instrument codes with developer library
• Story 2: Data collecting and viewing
• Story 3: Data post processing
• Performance optimization example with PPA
• Other outlined features & future enhancements
• Q&A
6/13/11
Agenda
6/13/11 Copyright (C) 2011 MulticoreWare Inc
PPA Data Collection
6/13/11 Copyright (C) 2011 MulticoreWare Inc
PPA Data Viewer
• What is Parallel Path Analyzer?
• PPA overview
• Developer Library overview
• Story 1: Instrument codes with developer library
• Story 2: Data collecting and viewing
• Story 3: Data post-processing
• Performance optimization example with PPA
• Other outlined features & future enhancements
• Q&A
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Agenda
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Data Post-processing – Framing
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Data Post-processing – Framing
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Data Post-processing – Extractor
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Data Post-processing – Extractor
• What is Parallel Path Analyzer?
• PPA overview
• Developer Library overview
• Story 1: Instrument codes with developer library
• Story 2: Data collecting and viewing
• Story 3: Data post-processing
• Performance optimization example with PPA
• Other outlined features & future enhancements
• Q&A
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Agenda
• 5 stages of particle pipeline for each timestep (i.e. frame)
• Each stage has single or multiple kernel launches as well as data copy
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Bullet capture before opt
• All clFinish calls except for runCollideParticlesKernel stage are removed
• Host is still blocked by last clFinish() and clEnqReadBuffer for CLInterop
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Bullet capture after initial try
• In fact No ClFinish is needed since no data read is required till ClInterop
• Host is not blocked except for ClInterop; Main gaps are remove!
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Bullet capture after opt
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Bullet perf before & after opt
• What is Parallel Path Analyzer?
• PPA overview
• Developer Library overview
• Story 1: Instrument codes with developer library
• Story 2: Data collecting and viewing
• Story 3: Data post-processing
• Performance optimization example with PPA
• Other outlined features & future enhancements
• Q&A
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Agenda
Special CL API calls removal for profiling perf concerns
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Other outlined features – duplicated events removal
• Pause & resume profiling to profile interesting spots only – Reduce captured data size as well
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Other outstanding features–
pause & resume profiling
• Allow profiling complicated applications consisting of independent modules
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Other outstanding features–
multiple app profiling capability
6/13/11 Copyright (C) 2011 MulticoreWare Inc
Other outstanding features–
System Monitor
• Interoperability with TM and GMAC
– Visualizing task scheduling, load balancing/ migration, data transfer in TM and GMAC in an intuitive way
• Automatic CPU profiling events insertion
• Automatic critical path analysis
• GPU rendering profiling capability
6/13/11 Copyright (C) 2011, MulticoreWare Inc
Future enhancements
6/13/11 Copyright (C) 2011 MulticoreWare Inc
• Visit http://www.multicorewareinc.com to request closed beta application
• Contact email: [email protected]
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in
this presentation are for informational purposes only and may be trademarks of their respective owners.
The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and
opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is
not responsible for the content herein and no endorsements are implied.