University of MichiganElectrical Engineering and Computer Science1
Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems
Hyoun Kyu Cho and Scott MahlkeUniversity of Michigan, Ann Arbor
December 2, 2012
University of MichiganElectrical Engineering and Computer Science
Critical Path
2
• Longest path between source and sink in DAG
University of MichiganElectrical Engineering and Computer Science
Critical Path
3
[Saidi`08]
3 25
35
2
93
142
2
523
10
University of MichiganElectrical Engineering and Computer Science
Critical Path for Multithreaded Programs
4
Call Call
Unlock
StartLock
EndLock
(a) Mutex Lock
T1 T2
Call Call
ArBarrier
(b) Barrier
T1 T2
Call
T3
ArBarrierArBarrier
LvBarrier LvBarrier LvBarrier
[Hollingsworth`98]
University of MichiganElectrical Engineering and Computer Science
Scalability of Multithreaded Programs
5
Some benchmarks does not scale very well!
University of MichiganElectrical Engineering and Computer Science
CPU Time Wasted on Synchronizations
6
Synchronization is major bottleneck!
University of MichiganElectrical Engineering and Computer Science
Accelerating Critical Path
8
• ACS [Suleman et al. ASPLOS `09]– Critical sections
• Voltage Boosting [Dreslinski `11]– Transactional bottlenecks
• Booster [Miller et al. HPCA `12]– Alleviate performance variation– Reactive acceleration for barriers
University of MichiganElectrical Engineering and Computer Science
Challenges and Opportunities of NTC
9
• Poor single thread performance• Very sensitive to process variation
– Running at the slowest one leads to severe loss– Likely to have performance heterogeneity
• Potential for bigger frequency boosting
University of MichiganElectrical Engineering and Computer Science
Objectives
10
• Systematic way of identifying critical paths
• Dealing with performance variation
• Flexible control of core boosting
University of MichiganElectrical Engineering and Computer Science
System Architecture
11
offline online
TargetProgram
IntermediateRepresentation
MonitoringLogic
Com
pila
tion
ParallelismAnalysis
Instrumented Executable
Monitor
instrumentation
Observe
AdjustPriority
Priority
Schedule
WeightedProbabilistic
PriorityScheduler
University of MichiganElectrical Engineering and Computer Science
Lottery Scheduling
12
• Each thread holds a number of tickets• Scheduler select fast mode thread by picking a ticket• Efficient implementation of proportional-share resource
management• Responsive, flexible control over relative execution rate
[Waldspurger`94]
10
total = 20random [0 .. 19] = 15
2 5 1 2
∑ = 10∑ > 15? no
∑ = 12∑ > 15? no
∑ = 17∑ > 15? yes
University of MichiganElectrical Engineering and Computer Science
Progress Monitoring
13
• For data parallel threads• Slower threads are more likely to be in critical path• Divide task into multiple smaller chunks and instrument
monitoring code• Monitoring code reduce number of tickets
University of MichiganElectrical Engineering and Computer Science
Example of Progress Monitoring
14
… pthread_barrier_wait(barrier); long PROGRESS_GRANULE = (k2 – k1) / NUM_STEPS; for ( i = k1 ; i < k2 ; i++ ) { float x_cost = dist(points->p[i],points->p[x],points->dim) * points->p[i].weight; float current_cost = points->p[i].cost; if ( x_cost < current_cost ) { switch_membership[i] = 1; cost_of_opening_x += x_cost – current_cost; } else { int assign = points->p[i].assign; lower[center_table[assign]] += current_cost – x_cost; } if ( (i – k1) % PROGRESS_GRANULE == 0 ) halve_priority_tickets(); } pthread_barrier_wait(barrier); …
Loop Body
University of MichiganElectrical Engineering and Computer Science
Priority Delegation
15
• Thread holding a mutex is likely to be in critical path– Increase tickets when acquire mutex
• More likely to be in critical path if other threads are waiting– Temporarily transfer waiting thread’s ticket to the thread
holding mutex
University of MichiganElectrical Engineering and Computer Science
Performance Evaluation
16
• Post processing traces– Generated on 32-core machine
• Four 8-core Intel Xeon X7560• 24MB L3 cache per chip• 32GB Total memory
– Augmented progress time indication• 1.5x, 2x, 5x, 10x acceleration for 1 fast mode core• Varying scheduling quantum from 1us to 1ms
University of MichiganElectrical Engineering and Computer Science
Speedup for Streamcluster
17
H/W OSUsermode
University of MichiganElectrical Engineering and Computer Science
Current Status
18
TargetProgram
IntermediateRepresentation
MonitoringLogic
Com
pila
tion
ParallelismAnalysis
Instrumented Executable
Monitor
instrumentation
Observe
AdjustPriority
Priority
Schedule
WeightedProbabilistic
PriorityScheduler
Normal
Turbo
Normal
Turbo
Co
res
University of MichiganElectrical Engineering and Computer Science
Conclusion & Future Work
19
• Introduce S/W framework to improve multithreaded programs’ performance using core boosting
• Combines static analysis, dynamic monitoring, and probabilistic priority scheduling to predict critical paths
• Shows 5% ~ 27% performance improvement for streamcluster
• Better model the tradeoff between performance and energy• Predicting critical paths on other type of parallelism