qq: nanoscale timing and profiling james frye † *, james g. king † *, christine j. wilson * ◊,...

38
QQ: Nanoscale Timing and Profiling James Frye *, James G. King *, Christine J. Wilson * , Frederick C. Harris, Jr. * Department of Computer Science and Engineering *Brain Computation Lab Biomedical Engineering University of Nevada, Reno NV 89557 Conference19th IEEE International Parallel & Distributed Process 4th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS’05)

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

QQ: Nanoscale Timing and Profiling

James Frye † *, James G. King † *, Christine J. Wilson * ◊, Frederick C. Harris, Jr. † *

†Department of Computer Science and Engineering*Brain Computation Lab◊Biomedical Engineering

University of Nevada, Reno NV 89557

2005 IPDPS Conference19th IEEE International Parallel & Distributed Processing Symposium

4th International Workshop onPerformance Modeling, Evaluation, and Optimization of

Parallel and Distributed Systems (PMEO-PDS’05)

What is QQ

• QQ is a simple and efficient tool for measuring timing and memory use

• Developed for the examination of a massively parallel program

• Easily extensible to inspect other programs

QQ Development

• QQ was developed to optimize a parallel program used to simulate cortical neurons – NeoCortical Simulator (NCS)

• Our goal for the summer of 2002 was to simulate 106 neurons with 109 synapses within a realistic run time

• Before optimization, NCS would run about 1.5 million synapses at a rate of 1 day per simulated second of synaptic activity

• Clearly optimization of NCS was needed

NeoCortical Simulator (NCS)

• Originated in the Brain Computation Lab led by Dr. Phil Goodman

• Incorporates membrane dynamics• Utilizes simulated ion channels to

modulate the membrane voltage changes (when applied)

• Compartment based simulator• Allows for channel dynamics to drive the

membrane voltage

NCS Biology

• Neuron – a brain cell and the basic unit or compartment

• Synapse – the region of communication between compartments

• Channel – openings in the cellular membrane that allow the passage of various ions to induce a voltage gradient across the membrane

• Action Potential – an electrical signal that translates to a chemical signal to the post-synaptic cell

Neurons

NCS Biology

• The membrane voltage determines the cell’s firing rate

• Once threshold voltage is reached the cell sends an action potential to it’s connected synapses

0

mV

Time (mS)

-45

30

Action Potential

2-Cell Model

Pre-Synaptic

Cell

Post-Synaptic

Cell

0.2 mV

100 200 300 400 5000

Time (ms)

No ChannelsSustained firing at maximum rate during a continuous stimulus

Ka ChannelSlows the initial response during a sustained stimulus

Km ChannelPrevents continuous bursting during a continuous stimulus

Kahp ChannelDampens the effect while still allowing for some action potentials during a sustained stimulus

QQ Design

• QQ is designed so that all of its routines can be selectively compiled into a program

• In the QQ.h header file, each routine is defined with a preprocessor directive, so that if profiling is not enabled, it reduces to an empty statement.

#ifdef QQ_ENABLEvoid QQInit (int);

#else#define QQInit (dummy)

#endif

QQ Design

• Memory profiling routines also use the C preprocessor to intercept library calls

#ifdef QQ_ENABLE#define malloc(arg) MemMalloc (MEM_KEY, arg)

#endif

• The MemMalloc function records allocation information, calls the malloc function to do the actual allocation, and returns the result to the caller

QQ Timing

• Extremely accurate measurement of execution speed.

• In theory fine-grained resolution to a single clock cycle.

• In practice, measurements are accurate to tens of cycles

Timing Measurements

• Measuring the impact of a line change in the calculation for the Km channel

From:I = unitaryG * strength * pow (m, mPower) * (ReversePot – CmpV);

To:I = unitaryG * strength * (ReversePot – CmpV);

• Km-type channel, mPower is always 1, so we were able to change the equation to streamline the execution

• Wrapping the line in calls to QQ, we measure the effect of this single change

QQStateOn (QQ_Km);I = unitaryG * strength * (ReversePot – CmpV);QQStateOff (QQ_Km);

Timing Measurements

• Note that both code versions give similar cycle counts on different processors, though more consistent and somewhat fewer on P4 than P3.

• Times for similar counts are proportional to processor speed, as expected.

• Function call pays a heavy penalty for first call. It's only called by Km channel code in this code, so time represents first load of the code into cache

Timing Measurements

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10

Old uSec

New uSec

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5 6 7 8 9 10

Old uSec

New uSec

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 7 8 9 10

Old Clockcycles

New Clockcycles

PIII – 800 MHz

0

200

400

600

800

1000

1200

1400

1 2 3 4 5 6 7 8 9 10

Old Clockcycles

New Clockcycles

Timing Measurements

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10

Old uSec

New uSec

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10

Old uSec

New uSec

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 7 8 9 10

Old clockcycles

New Clockcycles

P4 – 2200MHz

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10

Old clockcycles

New Clockcycles

Expanding Timing Information

• QQ allows the user to record an additional item of information with the normal timing.– QQCount records an integer with the key

– QQCount( eventKey, integer_of_interest );

– QQValue records a double precision floating point value with the key

– QQValue( eventKey, double_of_interest );

– QQState records a state of on or off with the key– QQStateOn( eventKey ); QQStateOff( eventKey );

• These will be described during discussion of the output format

QQ Memory

• Records memory allocation dedicated to the code-block, rather than the total allocation due to code and library calls, to single-byte accuracy

QQ Memory Example

• NCS implementation of ion channels• Suppose we want to know the total

memory used by all channels. Each channel function would require channel key:

#define MEM_KEY KEY_CHANNEL

• Then at any point in the program execution, just call the MemPrint function to display memory use

Memory Usage OutputMemory Allocation: Total Allocated = 988 KBytes Object Number Number Object Alloc Total MaxItem Size Created Deleted KB KB Kb KBBrain 120 1 0 1 0 1 1CellManager 44 1 0 1 1 1 1Cell 16 100 0 2 0 2 2Channel 252 300 0 74 0 74 74Compartment 324 100 0 32 2 33 33MessageMgr 16 1 0 1 205 205 205MessageBus 0 0 0 0 1 1 1Report 80 1 0 1 1 1 1Stimulus 252 1 0 1 1 1 1Synapse 44 10000 0 430 118 547 547

---------------------------------------------------------------------------------------------------------------------------------------------------------------1 2 3 4 5 6 7 8

Key 1 - Internal name given to recording category 2 - The size of the object being allocated - it's valid only if all allocations are the same size, as with "new Object". 3 - Number of allocation calls made: new, malloc, calloc, etc. 4 - Number of free or delete calls made 5 - KBytes allocated via object creation (new) 6 - KBytes allocated via *alloc calls 7 - Total memory currently allocated 8 - Max memory ever allocated = high-water mark.

QQ Applications

• Brain Communication Server (BCS)

• NCS

Further experimentation with the simulator required another application be developed to coordinate communication between NCS and numerous potential clients:

• virtual creatures• physical robots• visualization tools

BCS

Brain Communication Server

NCS

Optimizing BCS

Different applications make non-sequential requests. No single function was called in a loop iterating several times, so time needed to be measured over the course of execution. Then perform an analysis of QQ’s final output.

Parsing QQ’s output

• QQ uses a straight forward layout for the final output file

• The data can be easily extracted and displayed in a text report as shown on the previous slide or sent to a graphical display

• The following slides describe the output format and how to manage the information

QQ file format

HeaderNumber of Keys (int), Key Name string length (int)

Key TableFor each Key – Key ID (int), Key type (int), Key name (char *)

Node InformationNumber of nodes (int)Node Table

For each Node – Byte offset to data (size_t), Number of entries (int), Starting Base Time (unsigned long long), Mhz (double)

DataFor each Node, For each entry – item (QQItem)

QQ Format – Data Close Up

Node 0 Byte offset

Node 1 Byte offset

Node 2 Byte offset

Previous Sections

DataNode 0 – For each entry Key (int), [Optional Info], Event Time (unsigned long long)

Node 1 – For each entry Key (int), [Optional Info], Event Time (unsigned long long)

Node 2 – For each entry …

Where Optional Info is the size of a double, but contains a State (int), a Count (int), or a Value (double)

Gathering the Results

• After reading a node’s data section, entries with the same key can be gathered.

• Using the key table, the user knows what is contained in the second block of a timing entry

2 1 109342759

2 0 109342768

Example: Key 2 has type “State” The second block contains integer 1 for “on” or integer 0 for “off” By subtracting the event times, the length of time spent in the “on” state is determined

Another example4 -65.3477 109342735

4 -58.2367 109342819

Example: Key 4 has type “Value” The second block contains a double precision value passed in during execution The value can be saved and displayed with timing information, or sent to a separate graph Timing is obtained the same as before, by subtracting the event times

NCS Performance Measurement

• QQ was able to hone in on specific blocks of code and allow measurement at a resolution necessary to allow for easy interpretation

Optimization Targets

• QQ analysis quickly identified two major targets within the code

• Synapses

• Message Passing

Synapses

• Synapses were by far the most common element of any NCS model with the most memory usage– Active only when an action potential was

processed through the synapse– Pass information between the nodes via

message passing

Message Parsing Overhead

• Using QQ we were able to identify areas for improvement within NCS 3

• Many unneeded fields requiring better encoding of their destination

• Fixed number of messages pre-allocated, far more than needed by the program– Implemented a shared pool, buffers allocated as needed

• Messages sent individually, processed multiple times – Implemented a packet scheme: process packet once for send,

once for receive– Process messages only when used

Conclusions

• QQ allows profiling of nanoscale timing of code segments and memory usage analysis

• Fine grained measurements of specific events

• Ability to measure memory at an object or event level with a small memory and performance footprint

• Simple and effective tool

Future Work

• New Opteron cluster

• BlueGene migration (how many processors?)

• Robotic integration

Q & A