sc08 talk final
Post on 06-Jul-2015
359 Views
Preview:
TRANSCRIPT
Improving Throughput of Simultaneous Multithreading (SMT) Processors using
Application Signatures and Thread Priorities
Mitesh R. Meswani
University of Texas at El Paso (UTEP)
11/20/2008 1By Mitesh R. Meswani
Simultaneous Multithreading (SMT) Utilization
Thread-X Executing
Thread-Y Executing
No ThreadExecuting
Legend:
1 2 3 4 5 6
FP
FX
LSU
Processor Cycles
Single-Threaded Execution
1 2 3 4 5 6
FP
FX
LSU
Processor Cycles
SMT Execution
ExecutionUnits
ExecutionUnits
SMT with two hardware threads• SMT hardware contexts share most of the processor resources• Potential of 2x throughput with perfect resource sharing• Throughput gains limited by contention of shared resources
Thread X waits until resource is
free, due to sharing
Thread X uses unused resource
2By Mitesh R. Meswani11/20/2008
Research Question and Hypothesis
• SMT-performance Tunables:
– Enable or disable SMT mode
– Prioritize one hardware thread over the other
• Research Question:What are the optimal priority settings for best processor throughput?
• Hypothesis: Use hints from resource usage in Single-threaded mode
3By Mitesh R. Meswani11/20/2008
Dissertation Contributions
1. Showed that prioritization of threads improves throughput: Equal Priorities (default) are not best for nearly 47% of SPEC CPU2000/6, Stream, and Lmbench benchmarks co-schedules
2. Defined and captured application “signatures” which are its resource usage characteristics
3. Showed that a small set of signatures are present in real world applications: 16 Signatures are sufficient to represent 95.5% of execution time of SPEC CPU2006 (20) benchmarks, NAS NPB3.2 Serial (9) benchmarks, PETSc KSP (119), and PETSc Matrix (180) libraries
4. Developed a prediction methodology using microbenchmarks that represent signatures, and showed that our predictions have the potential to improve throughput: 87% of PETSc KSP coschedules experience better throughput with predicted priorities than default
4By Mitesh R. Meswani11/20/2008
Thread Priorities in IBM POWER5
• Six out of eight priorities available to the operating system for normal mode of operation: 1, 2, 3, 4 (default), 5, and 6
• Difference in hardware thread priorities control decode cycle sharing
Thread X
Priority
Thread Y
Priority
Priority
Difference
Thread X
Decode
Cycles
Thread Y
Decode
Cycles
6 1 5 63/64 1/64
6 2 4 31/32 1/32
6 3 3 15/16 1/16
6 4 2 7/8 1/8
6 5 1 3/4 1/4
4 (default) 4 (default) 0 1/2 1/2
5By Mitesh R. Meswani11/20/2008
Signatures
1. Identify Significant Resources : Floating-point unit (FPU), Fixed-point unit (FXU), L2 unified cache, and L2 unified TLB
2. Capture using performance counters
3. Define utilization levels of resources in Single-Threaded mode, forming a signature
– Ten utilization levels L1 to L10 per resource
– Example: L1L2L3L9, L9L6L7L8, L2L3L10L6…
6By Mitesh R. Meswani11/20/2008
Work Flow
Performance CounterSettings
Step 1: Find Signatures of Real Applications
Run Application and Periodically Sample
Counters
Serial Application
Single-Threaded
Mode
Signature Data Base
Signatures
Signature-microbenchmark Pair X, Y
CPI
Step 2: Create Signature Microbenchmarks for Frequently Appearing Signatures and Empirically Find Priority Predictions
Run Signature-Microbenchmark
Pair
Priorities i, j in SMT
Mode
Prediction Data Base
Store CPI for all priorities for
Pair X, Y
Identify Best Case Priority for
Pair X, Y
Predictions
Step 3:Execute Application Pairs using Predicted Priorities
Signature Data Base
Prediction Data Base
Read Signatures
Application Pair A, B
Read Priorities
Yes Signature of A,B
Run Pair A, B with Predicted
Priorities in SMT Mode
Priority of A,Priority of B
Found Dominating Signatures ?
Run Pair A, B with Equal Priorities in SMT Mode
No
7By Mitesh R. Meswani11/20/2008
Details of Step 1
• Four groups of counters were measured
• Each group measured in separate runs
• Sampled in one second time intervals
•The difference between the execution time across the 4 runs was negligible•For 99% of samples, the difference between the number of instructions and run cycles was negligible
Interval 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Sample#
Run 1
Run 2
Run 3
Run 4
8By Mitesh R. Meswani11/20/2008
Different Signatures are Present in Real Applications
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
429.mcf 416.gamess 444.namd 462.libquantum cgs gmres
L1L1L1L1
L3L1L1L1
L3L2L1L1
L2L1L1L1
L2L3L1L1
L2L2L1L1
L1L4L1L1
L1L1L9L5
L1L2L7L4
L1L1L7L4
L1L1L6L4
L1L2L6L3
L1L2L5L2
L1L3L1L1
L1L2L2L1
L1L2L3L1
L1L2L6L4
L1L2L5L4
L1L2L5L3
L1L2L4L3
L1L2L4L2
L1L2L3L2
L1L1L2L1
L1L2L1L1
%o
f To
tal C
ycle
s
Signature Histogram of Four SPEC CPU2006 and Two PETSc KSP Library Functions
Applications
9By Mitesh R. Meswani11/20/2008
Conclusions
1. Showed that equal priorities (default) are not the best for nearly 47% of applications studied
2. Only 16 Signatures are sufficient to represent 95.5% of execution time of 20 SPEC CPU2006 benchmarks, 9 NAS NPB3.2 Serial benchmarks, 119 PETSc KSP, and 180 PETSc Matrix libraries
3. Priority predictions using signature benchmarks improve throughput over default settings for 87% of the 15 PETSc KSP coschedules.
10By Mitesh R. Meswani11/20/2008
Applications with Multiple Signatures
11By Mitesh R. Meswani11/20/2008
Future Work and References
Future Work:• Identify applications with multiple signatures• Dynamic adaptation of priorities• Detecting signatures on the fly• Phase detection and Prediction for a truly adaptive system
References:• M. R. Meswani, P. J. Teller, and S. Arunangiri., “A Study of the Influence
of the POWER5 Dynamic Resource Balancing Hardware on Optimal Hardware Thread Priorities,” To Appear in the Proceedings of the 2008 Live Virtual Constructive Conference, Jan 2009, El Paso, TX
• M. R. Meswani and P. J. Teller, “ Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000,” Proceedings of the 2nd International Workshop on Operating Systems Interference In High Performance Applications, in conjunction with the 15th International Conferences on Parallel Architectures and Compilation Techniques (PACT06) Conference, sponsored by ACM and IEEE, September 2006, Seattle, WA.
12By Mitesh R. Meswani11/20/2008
Acknowledgements
• This work is supported by AHPCRC Grant W11NF-07-2-2007
• Amir Simon, IBM for his valuable assistance with fixing the firmware of the p550 machine
11/20/2008 By Mitesh R. Meswani 13
Questions?
14By Mitesh R. Meswani11/20/2008
EXTRA SLIDES
11/20/2008 15By Mitesh R. Meswani
Simultaneous Multithreading (SMT)
Shared Resource
Program Counter-X
Program Counter-Y
InstructionFetch
InstructionBuffer-X
InstructionBuffer-Y
Decode
FPU
FXU
LSU
WriteBack-X
WriteBack-Y
Thread-X Resource Thread-Y ResourceLegend:
Instruction Cache
Instruction TLB
Data TLB
Data Cache
SMT hardware contexts share most of the processor resources.
11/20/2008 16By Mitesh R. Meswani
Methodology Overview - 1
1. Identify significant subset of shared resources– Resources Identified: L2 unified cache, L2 unified
TLB, Floating-point unit (FPU), and Fixed-point unit (FXU)
2. Identify and validate performance counters3. Define utilization levels of resources in Single-Threaded
mode, forming a signature– Ten utilization levels L1 to L10 per resource: L1 is 0%-10%, L2 is
11%-20%, …, L10 is 90%-100%– A signature is represented as utilization levels (L1-L10) of
FPU, FXU, L2 cache, and L2 TLB.– Example: L1L2L3L9, L9L6L7L8, …
4. An application is said to have one dominating signature, if the signature is associated with at least 80% of the application execution time
11/20/2008 17By Mitesh R. Meswani
Results – 2: Small Subset of Signatures are Sufficient to Represent Majority of the Execution Time of Applications
16.3%
13.2%
12.0%
6.8%5.6%
5.3%
5.3%
5.0%
4.8%
4.3%
3.9%
3.8%
3.5% 2.4%
1.9%
1.4%
4.4%L1L3L1L1
L1L1L1L1
L1L2L1L1
L2L3L1L1
L3L1L3L1
L4L1L1L1
L2L1L1L1
L2L2L1L1
L3L1L1L1
L1L2L2L1
L2L2L2L1
L1L2L3L1
L1L1L2L1
L2L1L2L1
L5L1L1L1
L3L1L2L1
Others (19)
16 Signatures are Sufficient to Represent 95.6% of Execution Time of 20 SPEC CPU2006, 9 NAS NPB3.2 Serial, 119 PETSc KSP, and 180 PETSc Matrix Benchmarks
11/20/2008 18By Mitesh R. Meswani
Results –Priority Predictions using Signature Benchmarks can Potentially Improve Throughput
Prediction Thread X
Signature
Thread X Thread Y
Signature
Thread Y Best Case Worst Case
6-5 bicg L1L2L1L1 bicg L1L2L1L1 6-6 3-6
4-6 bicg L1L2L1L1 lsqr L1L3L1L1 6-6 2-6
5-6 bicg L1L2L1L1 tcqmr L1L1L1L1 6-2 1-6
6-6 lsqr L1L3L1L1 lsqr L1L3L1L1 6-5 1-6
5-6 lsqr L1L3L1L1 tcqmr L1L1L1L1 6-2 1-6
6-5 tcqmr L1L1L1L1 tcqmr L1L1L1L1 6-5 3-6
6-5 bcgs L1L1L1L1 bcgs L1L1L1L1 6-5 2-6
6-5 bcgs L1L1L1L1 bicg L1L2L1L1 6-5 2-6
6-5 bcgs L1L1L1L1 cgs L1L1L1L1 6-5 3-6
6-5 bcgs L1L1L1L1 chebychev L1L1L1L1 6-1 3-6
6-5 bcgs L1L1L1L1 cr L1L1L1L1 6-1 1-6
6-5 bcgs L1L1L1L1 gmres L1L1L1L1 6-1 2-6
6-5 bcgs L1L1L1L1 lsqr L1L3L1L1 6-5 1-6
6-5 bcgs L1L1L1L1 richardson L1L1L1L1 6-1 3-6
6-5 bcgs L1L1L1L1 tcqmr L1L1L1L1 6-1 1-6
For 15 PETSc KSP co-schedules, predicted settings• improved throughput over default for 87% of co-schedules, • are the best for 33% of co-schedules, and• are never the worst case settings
11/20/2008 19By Mitesh R. Meswani
Signatures in Applications
• PETSc Linear Solvers• Identify signature using performance counters• Results:• STORY:
– Using simulator, showed that intelligent settings of hardware thread priorities can enhance workload performance
– Critical microarchitecture resource usage “signatures” can be used to determine “best” priorities
– Different signatures exist in real-world applications and have been shown to be useful in enhancing utilization and throughput
11/20/2008 20By Mitesh R. Meswani
Signatures and Application Phases
Phase TransitionsInterval Consecutive Phases
• Application executions are composed of multiplephases
• For each phase in Single-Threaded mode, monitorutilization of shared resources for each phase(Signature)
• Resource utilization can be used to estimateavailability of resources for other threads
• Given signatures of two threads, predict threadpriorities that maximize overall throughput
11/20/2008 21By Mitesh R. Meswani
POWER5 Chip
• POWER5 Chip:Two identical cores, each core with two SMT threads, 64KB L1 ICache, 32KB L1 DCache, Shared Unified 1.92MB L2 Cache, off-chip 36MB L3 Cache, 128-entry L1 ITLB, 128-entry L1 DTLB, and 1024-entry Unified L2 TLB
11/20/2008 22By Mitesh R. Meswani
FPU and FXU Benchmark
• Benchmarks runs for 100s in Single-Threaded mode• Data dependencies and noops are introduced to
lower utilization levels• Utilization achieved was:
– FPU : 10% to 99%– FXU : 10% to 70%
FPU Benchmark for Maximum Utilization (99%)Loop:fadd R0,R0,R0,
…fadd R31,R31,R31
above block copied four timescount++;
branch to loop if count<max
FXU Benchmark for Maximum Utilization (70%)Loop:addi R0,R0,0
…addi R31,R31,31
above block copied six timescount++;
branch to loop if count<max
11/20/2008 23By Mitesh R. Meswani
L2 Cache and L2 TLB Benchmark
• Benchmarks runs for 100s in Single-Threaded mode• Repeated access to an element are introduced in the
while loop to reduced utilization levels• Utilization was achieved in the range of 10% to 99%
L2 Cache Benchmark for Maximum Utilization (99%)1. Allocate array bigger than L2 cache2. First element of cache line 1 points to first element of
line 4, which points to first element of line 7, and so on; stride is 3 cache lines
3. Main body implements pointer chasing shown below:
for(j=0;j<1000000;j++){ elem=(int *)arr[0]; //initialize to point to first element
while(elem!=NULL) // continue while not last line elem=(int *)*elem; // load address of line + stride}
L2 TLB Benchmark for Maximum Utilization (99%)1. Allocate array bigger than number of pages mapped
by TLB entries2. First element of a page to first element of next page;
stride is one page3. Main body implements pointer chasing shown below:
for(j=0;j<400000;j++){ elem=(int *)arr[0]; //initialize to point to first element
while(elem!=NULL) // continue while not last pageelem=(int *)*elem; // load address of next page}
11/20/2008 24By Mitesh R. Meswani
Multi-resource Signature Benchmark
• Loop body varies number of fpu, fxu operations and stride access to achieve desired signature
• Each benchmark runs for 100s in Single-Threaded mode• Total of 12 signatures out of 16 possible were developed,
– Signatures developed are: LLLL, LLHL, LLHH, LHLL, LHHL, LHHH, HLLL, HLHL, HLHH, HHLL, HHHL, HHHH
– Signatures with low utilization of L2 cache and high utilization of TLB were not developed, namely LLLH, LHLH, HLLH, HHLH
LLHH Benchmark1. Allocate array bigger than number of pages mapped
by TLB entries2. First element of a page to first element of next page;
stride is one page3. Main body implements pointer chasing and a few
floating-point and integer operations shown below:for(j=0;j<390000;j++){ elem=(int *)arr[0]; //initialize to point to first element
while(elem!=NULL) // continue while not last page{ elem=(int *)*elem; // load address of next page
8 floating-point additions;8 integer additions;
}}
HHLL Benchmark1. Allocate array bigger than L2 cache2. First element of a line points to first element of next
line; stride is one cache line3. Main body consists of floating-point and integer
operations and pointer chasing shown below:for(j=0;j<9000;j++){ elem=(int *)arr[0]; //initialize to point to first element
while(elem!=NULL) // continue while not last line{ 168 floating-point additions;
168 integer additions;elem=(int *)*elem; // load address of next line
}}
11/20/2008 25By Mitesh R. Meswani
top related