sc08 talk final

Improving Throughput of Simultaneous Multithreading (SMT) Processors using

Application Signatures and Thread Priorities

Mitesh R. Meswani

University of Texas at El Paso (UTEP)

11/20/2008 1By Mitesh R. Meswani

Simultaneous Multithreading (SMT) Utilization

Thread-X Executing

Thread-Y Executing

No ThreadExecuting

Legend:

1 2 3 4 5 6

Processor Cycles

Single-Threaded Execution

1 2 3 4 5 6

Processor Cycles

SMT Execution

ExecutionUnits

SMT with two hardware threads• SMT hardware contexts share most of the processor resources• Potential of 2x throughput with perfect resource sharing• Throughput gains limited by contention of shared resources

Thread X waits until resource is

free, due to sharing

Thread X uses unused resource

2By Mitesh R. Meswani11/20/2008

Research Question and Hypothesis

• SMT-performance Tunables:

– Enable or disable SMT mode

– Prioritize one hardware thread over the other

• Research Question:What are the optimal priority settings for best processor throughput?

• Hypothesis: Use hints from resource usage in Single-threaded mode

Dissertation Contributions

1. Showed that prioritization of threads improves throughput: Equal Priorities (default) are not best for nearly 47% of SPEC CPU2000/6, Stream, and Lmbench benchmarks co-schedules

2. Defined and captured application “signatures” which are its resource usage characteristics

3. Showed that a small set of signatures are present in real world applications: 16 Signatures are sufficient to represent 95.5% of execution time of SPEC CPU2006 (20) benchmarks, NAS NPB3.2 Serial (9) benchmarks, PETSc KSP (119), and PETSc Matrix (180) libraries

4. Developed a prediction methodology using microbenchmarks that represent signatures, and showed that our predictions have the potential to improve throughput: 87% of PETSc KSP coschedules experience better throughput with predicted priorities than default

Thread Priorities in IBM POWER5

• Six out of eight priorities available to the operating system for normal mode of operation: 1, 2, 3, 4 (default), 5, and 6

• Difference in hardware thread priorities control decode cycle sharing

Thread X

Priority

Thread Y

Priority

Difference

Thread X

Decode

Cycles

Thread Y

Decode

Cycles

6 1 5 63/64 1/64

6 2 4 31/32 1/32

6 3 3 15/16 1/16

6 4 2 7/8 1/8

6 5 1 3/4 1/4

4 (default) 4 (default) 0 1/2 1/2

Signatures

1. Identify Significant Resources : Floating-point unit (FPU), Fixed-point unit (FXU), L2 unified cache, and L2 unified TLB

2. Capture using performance counters

3. Define utilization levels of resources in Single-Threaded mode, forming a signature

– Ten utilization levels L1 to L10 per resource

– Example: L1L2L3L9, L9L6L7L8, L2L3L10L6…

Work Flow

Performance CounterSettings

Step 1: Find Signatures of Real Applications

Run Application and Periodically Sample

Counters

Serial Application

Single-Threaded

Signature Data Base

Signatures

Signature-microbenchmark Pair X, Y

Step 2: Create Signature Microbenchmarks for Frequently Appearing Signatures and Empirically Find Priority Predictions

Run Signature-Microbenchmark

Priorities i, j in SMT

Prediction Data Base

Store CPI for all priorities for

Pair X, Y

Identify Best Case Priority for

Pair X, Y

Predictions

Step 3:Execute Application Pairs using Predicted Priorities

Signature Data Base

Prediction Data Base

Read Signatures

Application Pair A, B

Read Priorities

Yes Signature of A,B

Run Pair A, B with Predicted

Priorities in SMT Mode

Priority of A,Priority of B

Found Dominating Signatures ?

Run Pair A, B with Equal Priorities in SMT Mode

Details of Step 1

• Four groups of counters were measured

• Each group measured in separate runs

• Sampled in one second time intervals

•The difference between the execution time across the 4 runs was negligible•For 99% of samples, the difference between the number of instructions and run cycles was negligible

Interval 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Sample#

Different Signatures are Present in Real Applications

429.mcf 416.gamess 444.namd 462.libquantum cgs gmres

L1L1L1L1

L3L1L1L1

L3L2L1L1

L2L1L1L1

L2L3L1L1

L2L2L1L1

L1L4L1L1

L1L1L9L5

L1L2L7L4

L1L1L7L4

L1L1L6L4

L1L2L6L3

L1L2L5L2

L1L3L1L1

L1L2L2L1

L1L2L3L1

L1L2L6L4

L1L2L5L4

L1L2L5L3

L1L2L4L3

L1L2L4L2

L1L2L3L2

L1L1L2L1

L1L2L1L1

Signature Histogram of Four SPEC CPU2006 and Two PETSc KSP Library Functions

Applications

Conclusions

1. Showed that equal priorities (default) are not the best for nearly 47% of applications studied

2. Only 16 Signatures are sufficient to represent 95.5% of execution time of 20 SPEC CPU2006 benchmarks, 9 NAS NPB3.2 Serial benchmarks, 119 PETSc KSP, and 180 PETSc Matrix libraries

3. Priority predictions using signature benchmarks improve throughput over default settings for 87% of the 15 PETSc KSP coschedules.

Applications with Multiple Signatures

Future Work and References

Future Work:• Identify applications with multiple signatures• Dynamic adaptation of priorities• Detecting signatures on the fly• Phase detection and Prediction for a truly adaptive system

References:• M. R. Meswani, P. J. Teller, and S. Arunangiri., “A Study of the Influence

of the POWER5 Dynamic Resource Balancing Hardware on Optimal Hardware Thread Priorities,” To Appear in the Proceedings of the 2008 Live Virtual Constructive Conference, Jan 2009, El Paso, TX

• M. R. Meswani and P. J. Teller, “ Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000,” Proceedings of the 2nd International Workshop on Operating Systems Interference In High Performance Applications, in conjunction with the 15th International Conferences on Parallel Architectures and Compilation Techniques (PACT06) Conference, sponsored by ACM and IEEE, September 2006, Seattle, WA.

Acknowledgements

• This work is supported by AHPCRC Grant W11NF-07-2-2007

• Amir Simon, IBM for his valuable assistance with fixing the firmware of the p550 machine

11/20/2008 By Mitesh R. Meswani 13

Questions?

EXTRA SLIDES

Simultaneous Multithreading (SMT)

Shared Resource

Program Counter-X

Program Counter-Y

InstructionFetch

InstructionBuffer-X

InstructionBuffer-Y

Decode

WriteBack-X

WriteBack-Y

Thread-X Resource Thread-Y ResourceLegend:

Instruction Cache

Instruction TLB

Data TLB

Data Cache

SMT hardware contexts share most of the processor resources.

Methodology Overview - 1

1. Identify significant subset of shared resources– Resources Identified: L2 unified cache, L2 unified

TLB, Floating-point unit (FPU), and Fixed-point unit (FXU)

2. Identify and validate performance counters3. Define utilization levels of resources in Single-Threaded

mode, forming a signature– Ten utilization levels L1 to L10 per resource: L1 is 0%-10%, L2 is

11%-20%, …, L10 is 90%-100%– A signature is represented as utilization levels (L1-L10) of

FPU, FXU, L2 cache, and L2 TLB.– Example: L1L2L3L9, L9L6L7L8, …

4. An application is said to have one dominating signature, if the signature is associated with at least 80% of the application execution time

Results – 2: Small Subset of Signatures are Sufficient to Represent Majority of the Execution Time of Applications

6.8%5.6%

3.5% 2.4%

4.4%L1L3L1L1

L1L1L1L1

L1L2L1L1

L2L3L1L1

L3L1L3L1

L4L1L1L1

L2L1L1L1

L2L2L1L1

L3L1L1L1

L1L2L2L1

L2L2L2L1

L1L2L3L1

L1L1L2L1

L2L1L2L1

L5L1L1L1

L3L1L2L1

Others (19)

16 Signatures are Sufficient to Represent 95.6% of Execution Time of 20 SPEC CPU2006, 9 NAS NPB3.2 Serial, 119 PETSc KSP, and 180 PETSc Matrix Benchmarks

Results –Priority Predictions using Signature Benchmarks can Potentially Improve Throughput

Prediction Thread X

Signature

Thread X Thread Y

Signature

Thread Y Best Case Worst Case

6-5 bicg L1L2L1L1 bicg L1L2L1L1 6-6 3-6

4-6 bicg L1L2L1L1 lsqr L1L3L1L1 6-6 2-6

5-6 bicg L1L2L1L1 tcqmr L1L1L1L1 6-2 1-6

6-6 lsqr L1L3L1L1 lsqr L1L3L1L1 6-5 1-6

5-6 lsqr L1L3L1L1 tcqmr L1L1L1L1 6-2 1-6

6-5 tcqmr L1L1L1L1 tcqmr L1L1L1L1 6-5 3-6

6-5 bcgs L1L1L1L1 bcgs L1L1L1L1 6-5 2-6

6-5 bcgs L1L1L1L1 bicg L1L2L1L1 6-5 2-6

6-5 bcgs L1L1L1L1 cgs L1L1L1L1 6-5 3-6

6-5 bcgs L1L1L1L1 chebychev L1L1L1L1 6-1 3-6

6-5 bcgs L1L1L1L1 cr L1L1L1L1 6-1 1-6

6-5 bcgs L1L1L1L1 gmres L1L1L1L1 6-1 2-6

6-5 bcgs L1L1L1L1 lsqr L1L3L1L1 6-5 1-6

6-5 bcgs L1L1L1L1 richardson L1L1L1L1 6-1 3-6

6-5 bcgs L1L1L1L1 tcqmr L1L1L1L1 6-1 1-6

For 15 PETSc KSP co-schedules, predicted settings• improved throughput over default for 87% of co-schedules, • are the best for 33% of co-schedules, and• are never the worst case settings

Signatures in Applications

• PETSc Linear Solvers• Identify signature using performance counters• Results:• STORY:

– Using simulator, showed that intelligent settings of hardware thread priorities can enhance workload performance

– Critical microarchitecture resource usage “signatures” can be used to determine “best” priorities

– Different signatures exist in real-world applications and have been shown to be useful in enhancing utilization and throughput

Signatures and Application Phases

Phase TransitionsInterval Consecutive Phases

• Application executions are composed of multiplephases

• For each phase in Single-Threaded mode, monitorutilization of shared resources for each phase(Signature)

• Resource utilization can be used to estimateavailability of resources for other threads

• Given signatures of two threads, predict threadpriorities that maximize overall throughput

POWER5 Chip

• POWER5 Chip:Two identical cores, each core with two SMT threads, 64KB L1 ICache, 32KB L1 DCache, Shared Unified 1.92MB L2 Cache, off-chip 36MB L3 Cache, 128-entry L1 ITLB, 128-entry L1 DTLB, and 1024-entry Unified L2 TLB

FPU and FXU Benchmark

• Benchmarks runs for 100s in Single-Threaded mode• Data dependencies and noops are introduced to

lower utilization levels• Utilization achieved was:

– FPU : 10% to 99%– FXU : 10% to 70%

FPU Benchmark for Maximum Utilization (99%)Loop:fadd R0,R0,R0,

…fadd R31,R31,R31

above block copied four timescount++;

branch to loop if count<max

FXU Benchmark for Maximum Utilization (70%)Loop:addi R0,R0,0

…addi R31,R31,31

above block copied six timescount++;

branch to loop if count<max

L2 Cache and L2 TLB Benchmark

• Benchmarks runs for 100s in Single-Threaded mode• Repeated access to an element are introduced in the

while loop to reduced utilization levels• Utilization was achieved in the range of 10% to 99%

L2 Cache Benchmark for Maximum Utilization (99%)1. Allocate array bigger than L2 cache2. First element of cache line 1 points to first element of

line 4, which points to first element of line 7, and so on; stride is 3 cache lines

3. Main body implements pointer chasing shown below:

for(j=0;j<1000000;j++){ elem=(int *)arr[0]; //initialize to point to first element

while(elem!=NULL) // continue while not last line elem=(int *)*elem; // load address of line + stride}

L2 TLB Benchmark for Maximum Utilization (99%)1. Allocate array bigger than number of pages mapped

by TLB entries2. First element of a page to first element of next page;

stride is one page3. Main body implements pointer chasing shown below:

for(j=0;j<400000;j++){ elem=(int *)arr[0]; //initialize to point to first element

while(elem!=NULL) // continue while not last pageelem=(int *)*elem; // load address of next page}

Multi-resource Signature Benchmark

• Loop body varies number of fpu, fxu operations and stride access to achieve desired signature

• Each benchmark runs for 100s in Single-Threaded mode• Total of 12 signatures out of 16 possible were developed,

– Signatures developed are: LLLL, LLHL, LLHH, LHLL, LHHL, LHHH, HLLL, HLHL, HLHH, HHLL, HHHL, HHHH

– Signatures with low utilization of L2 cache and high utilization of TLB were not developed, namely LLLH, LHLH, HLLH, HHLH

LLHH Benchmark1. Allocate array bigger than number of pages mapped

by TLB entries2. First element of a page to first element of next page;

stride is one page3. Main body implements pointer chasing and a few

floating-point and integer operations shown below:for(j=0;j<390000;j++){ elem=(int *)arr[0]; //initialize to point to first element

while(elem!=NULL) // continue while not last page{ elem=(int *)*elem; // load address of next page

8 floating-point additions;8 integer additions;

HHLL Benchmark1. Allocate array bigger than L2 cache2. First element of a line points to first element of next

line; stride is one cache line3. Main body consists of floating-point and integer

operations and pointer chasing shown below:for(j=0;j<9000;j++){ elem=(int *)arr[0]; //initialize to point to first element

while(elem!=NULL) // continue while not last line{ 168 floating-point additions;

168 integer additions;elem=(int *)*elem; // load address of next line

sc08 talk final

thread priorities mitesh

l2l3l10l611202008by

equal priorities default

best processor throughput

priorities available

small set of signatures

resource example

smt modestep

Technology

pcei talk-three final

car talk jan2010 final

mehta nifa talk-final

lets talk final

ted talk final

whalley shu-talk-final

pycon talk final

cross talk final symposium

in the supreme court of florida the …...in the supreme...

be ea-talk-final

hdl talk final

cmpi talk may2013 final

parallel tools platform sc08 bof -...

biliary talk final

falk, final talk

talk hg final version2

sc08 workshop on the path to exascale: an overview of

ala talk final

pfizer talk final

final belgrade talk 1